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Preface 



Here are the course lecture notes for the course MAS 108, Probability I, at Queen 
Mary, University of London, taken by most Mathematics students and some others 
in the first semester. 

The description of the course is as follows: 

This course introduces the basic notions of probability theory and de- 
velops them to the stage where one can begin to use probabilistic 
ideas in statistical inference and modelling, and the study of stochastic 
processes. Probability axioms. Conditional probability and indepen- 
dence. Discrete random variables and their distributions. Continuous 
distributions. Joint distributions. Independence. Expectations. Mean, 
variance, co variance, correlation. Limiting distributions. 

The syllabus is as follows: 

1. Basic notions of probability. Sample spaces, events, relative frequency, 
probability axioms. 

2. Finite sample spaces. Methods of enumeration. Combinatorial probability. 

3. Conditional probability. Theorem of total probability. Bayes theorem. 

4. Independence of two events. Mutual independence of n events. Sampling 
with and without replacement. 

5. Random variables. Univariate distributions - discrete, continuous, mixed. 
Standard distributions - hypergeometric, binomial, geometric, Poisson, uni- 
form, normal, exponential. Probability mass function, density function, dis- 
tribution function. Probabilities of events in terms of random variables. 

6. Transformations of a single random variable. Mean, variance, median, 
quantiles. 

7. Joint distribution of two random variables. Marginal and conditional distri- 
butions. Independence. 
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8. Covariance, correlation. Means and variances of linear functions of random 
variables. 

9. Limiting distributions in the Binomial case. 

These course notes explain the naterial in the syllabus. They have been "field- 
tested" on the class of 2000. Many of the examples are taken from the course 
homework sheets or past exam papers. 

Set books The notes cover only material in the Probability I course. The text- 
books listed below will be useful for other courses on probability and statistics. 
You need at most one of the three textbooks listed below, but you will need the 
statistical tables. 

• Probability and Statistics for Engineering and the Sciences by Jay L. De- 
vore (fifth edition), published by Wadsworth. 

Chapters 2-5 of this book are very close to the material in the notes, both in 
order and notation. However, the lectures go into more detail at several points, 
especially proofs. If you find the course difficult then you are advised to buy 
this book, read the corresponding sections straight after the lectures, and do extra 
exercises from it. 

Other books which you can use instead are: 

• Probability and Statistics in Engineering and Management Science by W. W. 
Hines and D. C. Montgomery, published by Wiley, Chapters 2-8. 

• Mathematical Statistics and Data Analysis by John A. Rice, published by 
Wadsworth, Chapters 1-4. 

You should also buy a copy of 

• New Cambridge Statistical Tables by D. V. Lindley and W. F. Scott, pub- 
lished by Cambridge University Press. 

You need to become familiar with the tables in this book, which will be provided 
for you in examinations. All of these books will also be useful to you in the 
courses Statistics I and Statistical Inference. 

The next book is not compulsory but introduces the ideas in a friendly way: 

• Taking Chances: Winning with Probability, by John Haigh, published by 
Oxford University Press. 
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Web resources Course material for the MAS 108 course is kept on the Web at 
the address 

http : / /www .maths . qmw . ac . uk/ ~p jc/MAS108/ 

This includes a preliminary version of these notes, together with coursework 
sheets, test and past exam papers, and some solutions. 
Other web pages of interest include 

http : //www . dart mouth . edu/ -chance /teaching_aids/ 
books_articles /probability Joook/pdf . html 

A textbook Introduction to Probability, by Charles M. Grinstead and J. Laurie 
Snell, available free, with many exercises. 

http : //www .math . uah . edu/stat / 

The Virtual Laboratories in Probability and Statistics, a set of web-based resources 
for students and teachers of probability and statistics, where you can run simula- 
tions etc. 

http : / /www . newt on . cam. ac . uk/wmy2kposters/ july/ 

The Birthday Paradox (poster in the London Underground, July 2000). 

http : / /www . combinatorics .org/ Survey s/ds 5 /VennEJC . html 

An article on Venn diagrams by Frank Ruskey, with history and many nice pic- 
tures. 

Web pages for other Queen Mary maths courses can be found from the on-line 
version of the Maths Undergraduate Handbook. 

Peter J. Cameron 
December 2000 
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Chapter 1 
Basic ideas 



In this chapter, we don't really answer the question 'What is probability?' No- 
body has a really good answer to this question. We take a mathematical approach, 

writing down some basic axioms which probability must satisfy, and making de- 
ductions from these. We also look at different kinds of sampling, and examine 
what it means for events to be independent. 

1.1 Sample space, events 

The general setting is: We perform an experiment which can have a number of 
different outcomes. The sample space is the set of all possible outcomes of the 
experiment. We usually call it S. 

It is important to be able to list the outcomes clearly. For example, if I plant 
ten bean seeds and count the number that germinate, the sample space is 

5 = {0,1, 2,3,4,5,6,7,8,9,10}. 

If I toss a coin three times and record the result, the sample space is 

S = {HHH,HHT,HTH,HTT, THE, THT, TTH, TTT}, 

where (for example) HTH means 'heads on the first toss, then tails, then heads 
again'. 

Sometimes we can assume that all the outcomes are equally likely. (Don't 
assume this unless either you are told to, or there is some physical reason for 
assuming it. In the beans example, it is most unlikely. In the coins example, 
the assumption will hold if the coin is 'fair': this means that there is no physical 
reason for it to favour one side over the other.) If all outcomes are equally likely, 
then each has probability l/\S\. (Remember that |5| is the number of elements in 
the set S). 
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On this point, Albert Einstein wrote, in his 1905 paper On a heuristic point 
of view concerning the production and transformation of light (for which he was 
awarded the Nobel Prize), 

In calculating entropy by molecular-theoretic methods, the word "prob- 
ability" is often used in a sense differing from the way the word is 
defined in probability theory. In particular, "cases of equal probabil- 
ity" are often hypothetically stipulated when the theoretical methods 
employed are definite enough to permit a deduction rather than a stip- 
ulation. 

In other words: Don't just assume that all outcomes are equally likely, especially 
when you are given enough information to calculate their probabilities! 

An event is a subset of S. We can specify an event by listing all the outcomes 
that make it up. In the above example, let A be the event 'more heads than tails' 
and B the event 'heads on last throw'. Then 

A = {HHH,HHT,HTHJHH}, 
B = {HHH,HTH,THH,TTH}. 

The probability of an event is calculated by adding up the probabilities of all 
the outcomes comprising that event. So, if all outcomes are equally likely, we 
have 




In our example, both A and B have probability 4/8 = 1/2. 

An event is simple if it consists of just a single outcome, and is compound 
otherwise. In the example, A and B are compound events, while the event 'heads 
on every throw' is simple (as a set, it is {HHH}). If A = {a} is a simple event, 
then the probability of A is just the probabihty of the outcome a, and we usually 
write P{a), which is simpler to write than P{{a}). (Note that a is an outcome, 
while {a} is an event, indeed a simple event.) 

We can build new events from old ones: 

• AUB (read A union 5') consists of all the outcomes in A or in B (or both!) 

• AflB (read 'A intersection B') consists of all the outcomes in both A and B; 

• A\B (read A minus B') consists of all the outcomes in A but not in B; 

• A' (read 'A complement') consists of all outcomes not in A (that is, 5 \ A); 

• (read 'empty set') for the event which doesn't contain any outcomes. 
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Note the backward-sloping slash; this is not the same as either a vertical slash | or 
a forward slash /. 

In the example, A' is the event 'more tails than heads', and A n 5 is the event 
{HHH, THH.HTH}. Note that P{AnB) = 3/8; this is not equal to P{A) ■ P{B), 
despite what you read in some books! 

1.2 What is probability? 

There is really no answer to this question. 

Some people think of it as 'limiting frequency'. That is, to say that the proba- 
bility of getting heads when a coin is tossed means that, if the coin is tossed many 
times, it is likely to come down heads about half the time. But if you toss a coin 
1000 times, you are not likely to get exactly 500 heads. You wouldn't be surprised 
to get only 495. But what about 450, or 100? 

Some people would say that you can work out probability by physical argu- 
ments, like the one we used for a fair coin. But this argument doesn't work in all 
cases, and it doesn't explain what probabiUty means. 

Some people say it is subjective. You say that the probability of heads in a 
coin toss is 1/2 because you have no reason for thinking either heads or tails more 
likely; you might change your view if you knew that the owner of the coin was a 
magician or a con man. But we can't build a theory on something subjective. 

We regard probability as a mathematical construction satisfying some axioms 
(devised by the Russian mathematician A. N. Kolmogorov). We develop ways of 
doing calculations with probability, so that (for example) we can calculate how 
unlikely it is to get 480 or fewer heads in 1000 tosses of a fair coin. The answer 
agrees well with experiment. 

1.3 Kolmogorov's Axioms 

Remember that an event is a subset of the sample space S- A number of events, 
say Ai,A2, . . ., are called mutually disjoint or pairwise disjoint if A, fl Ay = for 
any two of the events Aj and A^-; that is, no two of the events overlap. 

According to Kolmogorov's axioms, each event A has a probability P{A), 
which is a number. These numbers satisfy three axioms: 

Axiom 1: For any event A, we have P{A) > 0. 



Axiom 2: P{S) = 1. 
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Axiom 3: If the events Ai,A2, ... are pairwise disjoint, then 

P{Ai UA2 U • • •) = P{Ai)+P{A2) + ■■■ 

Note that in Axiom 3, we have the union of events and the sum of numbers. 
Don't mix these up; never write P{Ai) UP(A2), for example. Sometimes we sep- 
arate Axiom 3 into two parts: Axiom 3a if there are only finitely many events 
Ai,A2, . . . ,A„, so that we have 

P(AiU---UA„) = £P(AO, 

i=l 

and Axiom 3b for infinitely many. We will only use Axiom 3 a, but 3b is important 
later on. 

Notice that we write 

i=i 

for 

P{Ai)+P{A2) + --- + P{An). 

1.4 Proving things from the axioms 

You can prove simple properties of probability from the axioms. That means, 
every step must be justified by appealing to an axiom. These properties seem 
obvious, just as obvious as the axioms; but the point of this game is that we assume 
only the axioms, and build everything else from that. 

Here are some examples of things proved from the axioms. There is really no 
difference between a theorem, a proposition, and a corollary; they all have to be 
proved. Usually, a theorem is a big, important statement; a proposition a rather 
smaller statement; and a corollary is something that follows quite easily from a 
theorem or proposition that came before. 

Proposition 1.1 If the event A contains only a finite number of outcomes, say 
A = {ai,a2, . . . ,an}, then 

P{A) = P{ai) +P{a2) + • • ■+P{an). 

To prove the proposition, we define a new event A,- containing only the out- 
come ai, that is. A; = {o;}, for / = Then Ai, . . . ,A„ are mutually disjoint 
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(each contains only one element which is in none of the others), and Ai U A2 U 
• • • U A„ = A; so by Axiom 3a, we have 

P(A) = P(ai) +P(fl2) + ■ ■ ■ +P{an). 

Corollary 1.2 If the sample space S is finite, say S = {fli, • • • ,«n}. ^hen 

P{ai)+P{a2) + --- + P{an)--l. 

ForP(fli)+P(fl2)H \~P{an) = P{S) by Proposition 1.1, and P{S) = 1 by 

Axiom 2. Notice that once we have proved something, we can use it on the same 
basis as an axiom to prove further facts. 

Now we see that, if all the n outcomes are equally likely, and their probabil- 
ities sum to 1, then each has probability l/n, that is, l/\S\. Now going back to 
Proposition 1.1, we see that, if all outcomes are equally likely, then 

for any event A, justifying the principle we used earlier. 

Proposition 1.3 P(A') = 1 — P(A) for any event A. 

Let Ai = A and A2 = A' (the complement of A). Then A 1 n A2 = (that is, the 
events Ai and A2 are disjoint), and Ai UA2 = 5. So 

P{Ai)+P{A2) = P(AiUA2) (Axioms) 
= PiS) 

= 1 (Axiom 2). 

SoP(A)=P(Ai) = l-P(A2). 

Corollary 1.4 P(A) < I for any event A. 

For 1 -P(A) = P{A') by Proposition 1.3, and P{A') > by Axiom 1; so 1 - 
P{A) > 0, from which we get P{A) < 1. 

Remember that if you ever calculate a probability to be less than or more 
than 1, you have made a mistake! 

Corollary 1.5 P(0) = 0. 

For = S', so P(0) = l-P{S) by Proposition 1.3; and P{S) ^ 1 by Axiom 2, 
so P(0) = 0. 
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Here is another result. The notation A C B means that A is contained in B, that 
is, every outcome in A also belongs to B. 

Proposition 1.6 If A C B, then P{A) < P{B). 

This time, take Ai = A, A2 = 5 \ A. Again we have Ai n A2 = (since the 
elements of 5\A are, by definition, not in A), and Ai UA2 = B. So by Axiom 3, 

P(Ai) +P{A2) = P{Ai UA2) = P{B). 

In other words, P{A) +P{B\A) = P{B). Now P{B\A) > by Axiom 1; so 

P{A)<P{B), 

as we had to show. 

1.5 Inclusion-Exclusion Principle 




A Venn diagram for two sets A and B suggests that, to find the size of A U5, 
we add the size of A and the size of B, but then we have included the size of A fl 5 
twice, so we have to take it off. In terms of probability: 

Proposition 1.7 

P{AUB)=P{A)+P{B)-P{Ar\B). 

We now prove this from the axioms, using the Venn diagram as a guide. We 
see that A U 5 is made up of three parts, namely 

Ai=An5, A2=A\5, A3=5\A. 

Indeed we do have AU5 = AiUA2UA3, since anything in A U 5 is in both these 
sets orjust the first orjust the second. Similarly we have AiUA2= A and A1UA3 = 
B. 

The sets Ai , A2, A3 are mutually disjoint. (We have three pairs of sets to check. 
Now Ai nA2 = 0, since all elements of Ai belong to B but no elements of A2 do. 
The arguments for the other two pairs are similar - you should do them yourself.) 
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P{Ai)+P{A2), 
P{Ai) + P{Ai), 
P(Ai) + P(A2) + P(A3). 



From this we obtain 



P{A)+P{B)-P{AnB) = (p(Ai)+P{A2)) + {P{Ai)+P{A3))-P{Ai) 



as required. 

The Inclusion-Exclusion Principle extends to more than two events, but gets 
more complicated. Here it is for three events; try to prove it yourself. 



To calculate P{AUBLiC), we first add up P{A), P{B), and P{C). The parts in 
common have been counted twice, so we subtract P{A (IB), P(A nC) and P(5nC). 
But then we find that the outcomes lying in all three sets have been taken off 
completely, so must be put back, that is, we add P(A n 5 n C) . 

Proposition 1.8 For any three events A, B,C, we have 

P{ALiBUC)=P{A)+P{B)+P{C)-P{Ar)B)-P{AnC)-P{BnC)+P{AnBr)C). 
Can you extend this to any number of events? 

1.6 Other results about sets 

There are other standard results about sets which are often useful in probability 
theory. Here are some examples. 

Proposition 1.9 Let A,B,C be subsets of 3. 
Distributive laws: (A n 5) U C = (A U C) n (5 U C) and 

(Au5)nc= (Anc)u(5nc). 

De Morgan's Laws: (A U 5)' = A' n 5' and (A n 5)' = A' U B'. 

We will not give formal proofs of these. You should draw Venn diagrams and 
convince yourself that they work. 



P(Ai)+P(A2)+P(A3) 
P{AUB) 
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1.7 Sampling 

I have four pens in my desk drawer; they are red, green, blue, and purple. I draw a 
pen; each pen has the same chance of being selected. In this case, S = {R-, G, fi, P}, 
where R means 'red pen chosen' and so on. In this case, if A is the event 'red or 
green pen chosen', then 

/X \M 2 1 
p(A) = ^ = - = -. 
^ ^ \S\ 4 2 

More generally, if I have a set of n objects and choose one, with each one 
equally likely to be chosen, then each of the n outcomes has probability l/n, and 
an event consisting of m of the outcomes has probability m/n. 

What if we choose more than one pen? We have to be more careful to specify 
the sample space. 

First, we have to say whether we are 

• sampling with replacement, or 

• sampling without replacement. 

Sampling with replacement means that we choose a pen, note its colour, put 
it back and shake the drawer, then choose a pen again (which may be the same 
pen as before or a different one), and so on until the required number of pens have 
been chosen. If we choose two pens with replacement, the sample space is 

{RR, RG, RB, RP, 

GR, GG, GB, GP, 

BR, BG, BB, BP, 

PR, PG, PB, PP} 

The event 'at least one red pen' is {RR,RG,RB,RP,GR,BR,PR}, and has proba- 
bility 7/16. 

Sampling without replacement means that we choose a pen but do not put it 
back, so that our final selection cannot include two pens of the same colour. In 
this case, the sample space for choosing two pens is 

{ RG, RB, RP, 

GR, GB, GP, 

BR, BG, BP, 

PR, PG, PB } 



and the event 'at least one red pen' is {RG,RB,RP, GR,BR,PR}, with probability 
6/12= 1/2. 
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Now there is another issue, depending on whether we care about the order in 
which the pens are chosen. We will only consider this in the case of sampling 
without replacement. It doesn't really matter in this case whether we choose the 
pens one at a time or simply take two pens out of the drawer; and we are not 
interested in which pen was chosen first. So in this case the sample space is 

{{R,G},{R,B},{R,P},{G,B},{G,P},{B,P}}, 

containing six elements. (Each element is written as a set since, in a set, we don't 
care which element is first, only which elements are actually present. So the sam- 
ple space is a set of sets!) The event 'at least one red pen' is {{R, G}, {R,B}, {R,P}}, 
with probability 3/6— 1/2. We should not be surprised that this is the same as in 
the previous case. 

There are formulae for the sample space size in these three cases. These in- 
volve the following functions: 

n\ = n{n — l){n — 2) ■ ■ -1 
"Pj^ = n{n-l){n-2)---{n-k+l) 
«Q = "Pk/k\ 

Note that n\ is the product of all the whole numbers from 1 to n; and 
so that 

^ k\{n-k)\' 

Theorem 1.10 The number of selections of k objects from a set of n objects is 
given in the following table. 





with replacement 


without replacement 


ordered sample 






unordered sample 







In fact the number that goes in the empty box is Q> but this is much 
harder to prove than the others, and you are very unlikely to need it. 

Here are the proofs of the other three cases. First, for sampling with replace- 
ment and ordered sample, there are n choices for the first object, and n choices 
for the second, and so on; we multiply the choices for different objects. (Think of 
the choices as being described by a branching tree.) The product of k factors each 
equal to n is n*^. 



10 



CHAPTER 1. BASIC IDEAS 



For sampling without replacement and ordered sample, there are still n choices 
for the first object, but now only n — I choices for the second (since we do not 
replace the first), and n — 2 for the third, and so on; there are n — k+l choices for 
the kth object, since k — I have previously been removed and n — (k—l) remain. 
As before, we multiply. This product is the formula for "i^. 

For sampling without replacement and unordered sample, think first of choos- 
ing an ordered sample, which we can do in ways. But each unordered sample 
could be obtained by drawing it in fe! different orders. So we divide hy k\, obtain- 
ing "Pk/k\ = "Ck choices. 

In our example with the pens, the numbers in the three boxes are 4^ = 16, 
^P2 — 12, and ^€2 — 6, in agreement with what we got when we wrote them all 
out. 

Note that, if we use the phrase 'sampling without replacement, ordered sam- 
ple', or any other combination, we are assuming that all outcomes are equally 
likely. 

Example The names of the seven days of the week are placed in a hat. Three 
names are drawn out; these will be the days of the Probability I lectures. What is 
the probability that no lecture is scheduled at the weekend? 

Here the sampling is without replacement, and we can take it to be either 
ordered or unordered; the answers will be the same. For ordered samples, the 
size of the sample space is ^ P2, —1-6-5 — 210. If A is the event 'no lectures at 
weekends', then A occurs precisely when all three days drawn are weekdays; so 
|A| = = 5 ■ 4 ■ 3 = 60. Thus, P{A) = 60/210 = 2/7. 

If we decided to use unordered samples instead, the answer would be ^€3 /^Cs, 
which is once again 2/7. 

Example A six-sided die is rolled twice. What is the probability that the sum of 
the numbers is at least 10? 

This time we are sampling with replacement, since the two numbers may be 
the same or different. So the number of elements in the sample space is 6^ = 36. 

To obtain a sum of 10 or more, the possibilities for the two numbers are (4, 6), 
(5,5), (6,4), (5,6), (6,5) or (6,6). So the probability of the event is 6/36 = 1/6. 

Example A box contains 20 balls, of which 10 are red and 10 are blue. We draw 
ten balls from the box, and we are interested in the event that exactly 5 of the balls 
are red and 5 are blue. Do you think that this is more likely to occur if the draws 
are made with or without replacement? 

Let S be the sample space, and A the event that five balls are red and five are 
blue. 
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Consider sampling with replacement. Then \S\ = 20^°. What is |A|? The 
number of ways in which we can choose first five red balls and then five blue ones 
(that is, RRRRRBBBBB), is 10^ ■ 10^ = 10^''. But there are many other ways to get 
five red and five blue balls. In fact, the five red balls could appear in any five of 
the ten draws. This means that there are ^^Cs = 252 different patterns of five Rs 
and five 5s. So we have 

|A| = 252-10^°, 

and so 

252- 10^*^ 
P(A) = ^^ = 0.246... 

Now consider sampling without replacement. If we regard the sample as being 
ordered, then \S\ — ^^Pio- There are ^^Ps ways of choosing five of the ten red 
balls, and the same for the ten blue balls, and as in the previous case there are 
^^Cs patterns of red and blue balls. So 

\A\^C'P5?-''Cs, 

and 

P(A) = ^^^^—^ = 0.343... 

If we regard the sample as being unordered, then \S\ — ^^Ciq. There are ^^Cs 
choices of the five red balls and the same for the blue balls. We no longer have to 
count patterns since we don't care about the order of the selection. So 

and 



P{A)='-^^ = 0.343... 



This is the same answer as in the case before, as it should be; the question doesn't 
care about order of choices! 

So the event is more likely if we sample with replacement. 



Example I have 6 gold coins, 4 silver coins and 3 bronze coins in my pocket. I 
take out three coins at random. What is the probability that they are all of different 
material? What is the probability that they are all of the same material? 

In this case the sampling is without replacement and the sample is unordered. 
So |5| = ^■^Cs = 286. The event that the three coins are all of different material 
can occur in 6 ■ 4 ■ 3 = 72 ways, since we must have one of the six gold coins, and 
so on. So the probability is 72/286 = 0.252 . . . 



12 



CHAPTER 1. BASIC IDEAS 



The event that the three coins are of the same material can occur in 

^C3+^C3 + ^C3 = 20 + 4+1 = 25 

ways, and the probability is 25/286 = 0.087 . . . 

In a sampling problem, you should first read the question carefully and decide 
whether the sampling is with or without replacement. If it is without replacement, 
decide whether the sample is ordered (e.g. does the question say anything about 
the first object drawn?). If so, then use the formula for ordered samples. If not, 
then you can use either ordered or unordered samples, whichever is convenient; 
they should give the same answer. If the sample is with replacement, or if it 
involves throwing a die or coin several times, then use the formula for sampling 
with replacement. 

1.8 Stopping rules 

Suppose that you take a typing proficiency test. You are allowed to take the test 
up to three times. Of course, if you pass the test, you don't need to take it again. 
So the sample space is 

S = {pjpjfpjff}. 

where for example ffp denotes the outcome that you fail twice and pass on your 
third attempt. 

If all outcomes were equally likely, then your chance of eventually passing the 
test and getting the certificate would be 3/4. 

But it is unreasonable here to assume that all the outcomes are equally likely. 
For example, you may be very likely to pass on the first attempt. Let us assume 
that the probability that you pass the test is 0.8. (By Proposition 3, your chance 
of failing is 0.2.) Let us further assume that, no matter how many times you have 
failed, your chance of passing at the next attempt is still 0.8. Then we have 

P{p) = 0.8, 

p(fp) = 0.2-0.8 = 0.16, 

P(ffp) = 0.22-0.8 = 0.032, 

P{fff) = 0.2^=0.008. 

Thus the probability that you eventually get the certificate is P{{p,fp,ffp}) = 
0.8 + 0.16 + 0.032 = 0.992. Alternatively, you eventually get the certificate unless 
you fail three times, so the probability is 1 — 0.008 = 0.992. 

A stopping rule is a rule of the type described here, namely, continue the exper- 
iment until some specified occurrence happens. The experiment may potentially 
be infinite. 
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For example, if you toss a coin repeatedly until you obtain heads, the sample 
space is 

S = {H, TH, TTH, TTTH, . . .} 

since in principle you may get arbitrarily large numbers of tails before the first 
head. (We have to allow all possible outcomes.) 

In the typing test, the rule is 'stop if either you pass or you have taken the test 
three times'. This ensures that the sample space is finite. 

In the next chapter, we will have more to say about the 'multiplication rule' we 
used for calculating the probabilities. In the meantime you might Uke to consider 
whether it is a reasonable assumption for tossing a coin, or for someone taking a 
series of tests. 

Other kinds of stopping rules are possible. For example, the number of coin 
tosses might be determined by some other random process such as the roll of a 
die; or we might toss a coin until we have obtained heads twice; and so on. We 
will not deal with these. 

1.9 Questionnaire results 

The students in the ProbabiUty I class in Autumn 2000 filled in the following 
questionnaire: 

1. 1 have a hat containing 20 balls, 10 red and 10 blue. I draw 10 balls 
from the hat. I am interested in the event that I draw exactly five red and 
five blue balls. Do you think that this is more Ukely if I note the colour of 
each ball I draw and replace it in the hat, or if I don't replace the balls in 
the hat after drawing? 

More likely with replacement □ More likely without replacement □ 

2. What colour are your eyes? 

Blue □ Brown □ Green □ Other □ 



3. Do you own a mobile phone? Yes □ No O 
After discarding incomplete questionnaires, the results were as follows: 



Answer to 
question 


"More likely 
with replacement" 


"More hkely 
without replacement" 


Eyes 


Brown 


Other 


Brown 


Other 


Mobile phone 


35 


4 


35 


9 


No mobile phone 


10 


3 


7 


1 
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What can we conclude? 

Half the class thought that, in the experiment with the coloured balls, sampling 
with replacement make the result more likely. In fact, as we saw in Chapter 1, 
actually it is more likely if we sample without replacement. (This doesn't matter, 
since the students were instructed not to think too hard about it!) 

You might expect that eye colour and mobile phone ownership would have no 
influence on your answer. Let's test this. If true, then of the 87 people with brown 
eyes, half of them (i.e. 43 or 44) would answer "with replacement", whereas in 
fact 45 did. Also, of the 83 people with mobile phones, we would expect half (that 
is, 41 or 42) would answer "with replacement", whereas in fact 39 of them did. So 
perhaps we have demonstrated that people who own mobile phones are slightly 
smarter than average, whereas people with brown eyes are slightly less smart! 

In fact we have shown no such thing, since our results refer only to the peo- 
ple who filled out the questionnaire. But they do show that these events are not 
independent, in a sense we will come to soon. 

On the other hand, since 83 out of 104 people have mobile phones, if we 
think that phone ownership and eye colour are independent, we would expect 
that the same fraction 83/104 of the 87 brown-eyed people would have phones, 
i.e. (83 ■ 87)/104 = 69.4 people. In fact the number is 70, or as near as we can 
expect. So indeed it seems that eye colour and phone ownership are more-or-less 
independent. 

1.10 Independence 

Two events A and B are said to be independent if 

P{AnB)=P{A)-P{B). 

This is the definition of independence of events. If you are asked in an exam 
to define independence of events, this is the correct answer. Do not say that two 
events are independent if one has no influence on the other; and under no circum- 
stances say that A and B are independent if A fl 5 = (this is the statement that 
A and B are disjoint, which is quite a different thing!) Also, do not ever say that 
P{A r\B) — P{A) ■ P{B) unless you have some good reason for assuming that A 
and B are independent (either because this is given in the question, or as in the 
next-but-one paragraph). 

Let us return to the questionnaire example. Suppose that a student is chosen 
at random from those who filled out the questionnaire. Let A be the event that this 
student thought that the event was more likely if we sample with replacement; B 
the event that the student has brown eyes; and C the event that the student has a 
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mobile phone. Then 



P{A) 
P{B) 
P{C) 



52/104 
87/104 
83/104 



0.5, 

0.8365, 

0.7981. 



Furthermore, 



p{Ar)B) -- 
p(Anc) 
p{Bnc) -- 



45/104 = 0.4327, 
= 39/104 = 0.375, 
70/104 = 0.6731, 



P(A)-P(5) = 0.4183, 
P(A)P(C) =0.3990, 
P(5)nP(C) = 0.6676. 



So none of the three pairs is independent, but in a sense B and C 'come closer' 
than either of the others, as we noted. 

In practice, if it is the case that the event A has no effect on the outcome 
of event B, then A and B are independent. But this does not apply in the other 
direction. There might be a very definite connection between A and B, but still it 
could happen that P{A HB) = P{A) ■ P{B), so that A and B are independent. We 
will see an example shortly. 

Example If we toss a coin more than once, or roll a die more than once, then 
you may assume that different tosses or rolls are independent. More precisely, 
if we roll a fair six-sided die twice, then the probability of getting 4 on the first 
throw and 5 on the second is 1/36, since we assume that all 36 combinations of 
the two throws are equally likely. But (1/36) = (1/6) • (1/6), and the separate 
probabilities of getting 4 on the first throw and of getting 5 on the second are both 
equal to 1 /6. So the two events are independent. This would work just as well for 
any other combination. 

In general, it is always OK to assume that the outcomes of different tosses of a 
coin, or different throws of a die, are independent. This holds even if the examples 
are not all equally likely. We will see an example later. 

Example 1 have two red pens, one green pen, and one blue pen. 1 choose two 
pens without replacement. Let A be the event that I choose exactly one red pen, 
and B the event that 1 choose exactly one green pen. 
If the pens are called Ri,R2, G,B, then 



S = {RiR2,RiG,RiB,R2G,R2B,GB}, 
A = {RiG,RiB,R2G,R2B}, 
B = {RiG,R2G,GB} 
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We have P{A) = 4/6 = 2/3, P{B) = 3/6=1/2, P{Ar)B) = 2/6=1/3 = P(A)P(5), 
so A and 5 are independent. 

But before you say 'that's obvious', suppose that I have also a purple pen, 
and I do the same experiment. This time, if you write down the sample space 
and the two events and do the calculations, you will find that P{A) = 6/10 = 3/5, 
P{B) = 4/10 = 2/5, P{AnB) = 2/10 = 1/5 7^ P{A)P{B), so adding one more 
pen has made the events non-independent! 

We see that it is very difficult to tell whether events are independent or not. In 
practice, assume that events are independent only if either you are told to assume 
it, or the events are the outcomes of different throws of a coin or die. (There is 
one other case where you can assume independence: this is the result of different 
draws, with replacement, from a set of objects.) 

Example Consider the experiment where we toss a fair coin three times and 
note the results. Each of the eight possible outcomes has probability 1/8. Let A 
be the event 'there are more heads than tails', and B the event 'the results of the 
first two tosses are the same'. Then 

• A = {HHH,HHT,HTH,THH},P{A) = \/2, 

• B = {HHH,HHT,TTH,TTT},P{B) = 1/2, 

• AnB= {HHH, HHT}, P(A n 5) = 1 /4; 

so A and B are independent. However, both A and B clearly involve the results of 
the first two tosses and it is not possible to make a convincing argument that one 
of these events has no influence or effect on the other. For example, let C be the 
event 'heads on the last toss'. Then, as we saw in Part 1, 

• C = {HHH,HTH, THH, TTH}, P{C) = 1 /2, 

• A n C = {HHH, HTH, THH}, P(A n C) = 3/8; 

so A and C are not independent. 
Are B and C independent? 

1.11 Mutual independence 

This section is a bit technical. You will need to know the conclusions, though the 
arguments we use to reach them are not so important. 

We saw in the coin-tossing example above that it is possible to have three 
events A,5, C so that A and B are independent, B and C are independent, but A and 
C are not independent. 
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If all three pairs of events happen to be independent, can we then conclude 
that P{A n5 n C) = P{A) ■ P{B) ■ P{C)1 At first sight this seems very reasonable; 
in Axiom 3, we only required all pairs of events to be exclusive in order to justify 
our conclusion. Unfortunately it is not true. . . 

Example In the coin-tossing example, let A be the event 'first and second tosses 
have same result', B the event 'first and third tosses have the same result, and 
C the event 'second and third tosses have same result'. You should check that 
p[A) = P{B) = P{C) = 1 /2, and that the events A n 5, 5 n C, A n C, and A n 5 fl C 
are all equal to {HHH,TTT}, with probability 1/4. Thus any pair of the three 
events are independent, but 

p(An5nc) = 1/4, 

P{A)-P{B)-P(C) = 1/8. 

So A, 5, C are not mutually independent. 

The correct definition and proposition run as follows. 

Let Ai, . . . ,A„ be events. We say that these events are mutually independent if, 
given any distinct indices /i , /2, . . . , ik with k> 1, the events 

A/i n A,2 n ■ • • nA^^j and A;^ 

are independent. In other words, any one of the events is independent of the 
intersection of any number of the other events in the set. 

Proposition 1.11 Let A\,. . . ,A„ be mutually independent. Then 

p{Ai nA2 n ■ ■ • nA„) = p(Ai) • p{A2) ■ ■ ■ P{An). 



Now all you really need to know is that the same 'physical' arguments that 
justify that two events (such as two tosses of a coin, or two throws of a die) are 
independent, also justify that any number of such events are mutually independent. 

So, for example, if we toss a fair coin six times, the probability of getting the 
sequence HHTHHT is {l/lf = 1/64, and the same would apply for any other 
sequence. In other words, all 64 possible outcomes are equally likely. 



1.12 Properties of independence 

Proposition 1.12 If A and B are independent, then A andB' are independent. 
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We are given that P{AnB) = P{A) ■ P{B), and asked to prove that P{AnB') = 
P{A)-P{B'). 

From Corollary 4, we know that P{B') = 1 — P{B) . Also, the events ADB and 
A (IB' are disjoint (since no outcome can be both in B and B'), and their union 
is A (since every event in A is either in B or in B'); so by Axiom 3, we have that 
P{A) = P{AnB) +P{Ar)B'). Thus, 



which is what we were required to prove. 

Corollary 1.13 If A andB are independent, so are A' andB' . 

Apply the Proposition twice, first to A and B (to show that A and B' are inde- 
pendent), and then to B' and A (to show that B' and A' are independent). 

More generally, if events Ai, . . . ,A„ are mutually independent, and we replace 
some of them by their complements, then the resulting events are mutually inde- 
pendent. We have to be a bit careful though. For example, A and A' are not usually 
independent! 

Results like the following are also true. 

Proposition 1.14 Let events A, B, C be mutually independent. Then A and BHC 
are independent, and A and BUG are independent. 

Example Consider the example of the typing proficiency test that we looked at 
earlier. You are allowed up to three attempts to pass the test. 

Suppose that your chance of passing the test is 0.8. Suppose also that the 

events of passing the test on any number of different occasions are mutually inde- 
pendent. Then, by Proposition 1.11, the probability of any sequence of passes and 
fails is the product of the probabilities of the terms in the sequence. That is. 



P{p) = 0.8, Pifp) = (0.2) • (0.8), Piffp) = (0.2)2 . p^^^^^ ^ ^q^^)^ 



p{AnB') 



P{A)-P{AnB) 
P{A)-P{A)-P{B) 




as we claimed in the earlier example. 

In other words, mutual independence is the condition we need to justify the 
argument we used in that example. 
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Example 

The electrical apparatus in the diagram 
works so long as current can flow from left 
to right. The three components are inde- 
pendent. The probability that component A 
works is 0.8; the probability that compo- 
nent B works is 0.9; and the probability that 
component C works is 0.75. 
Find the probability that the apparatus works. 

At risk of some confusion, we use the letters A, B and C for the events 'com- 
ponent A works', 'component 5 works', and 'component C works', respectively. 
Now the apparatus will work if either A and B are working, or C is working (or 
possibly both). Thus the event we are interested in is {A (IB) UC. 

Now 

p((An5)uc)) = p{AnB)+p{c)-P{Ar]Br]C) 

(by Inclusion-Exclusion) 
= PiA) ■ P{B) + P{C) - P{A) ■ P{B) ■ P{C) 

(by mutual independence) 
= (0.8) • (0.9) + (0.75) - (0.8) • (0.9) • (0.75) 
= 0.93. 

The problem can also be analysed in a different way. The apparatus will not 
work if both paths are blocked, that is, if C is not working and one of A and B is 

also not working. Thus, the event that the apparatus does not work is (A' UB')nC'. 
By the Distributive Law, this is equal to (A' n C) U (5' n C) . We have 

p((A'nc')u(5'nc') = p{A'nc')+p{B'nc')-p{A'nB'r]C') 

(by Inclusion-Exclusion) 
= P{A') ■ P{C') + P{B') ■ P{C') - P{A') ■ P{B') ■ P{C') 

(by mutual independence of A',5',C') 
= (0.2) • (0.25) + (0.1) • (0.25) - (0.2) • (0.1) • (0.25) 
= 0.07, 

so the apparatus works with probability 1 — 0.07 = 0.93. 

There is a trap here which you should take care to avoid. You might be tempted 
to say P(A' nC) = (0.2) • (0.25) = 0.05, and P{B'nC') = (0.1) • (0.25) = 0.025; 
and conclude that 

P((A'nC') U {B'nC')) = 0.05 + 0.025 - (0.05) • (0.025) = 0.07375 

by the Principle of Inclusion and Exclusion. But this is not correct, since the 
events A' n C' and B' n C' are not independent! 
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Example We can always assume that successive tosses of a coin are mutually 
independent, even if it is not a fair coin. Suppose that I have a coin which has 
probability 0.6 of coming down heads. I toss the coin three times. What are the 
probabilities of getting three heads, two heads, one head, or no heads? 

For three heads, since successive tosses are mutually independent, the proba- 
bility is (0.6)3 ^ 0.216. 

The probability of tails on any toss is 1 — 0.6 = 0.4. Now the event 'two 
heads' can occur in three possible ways, as HHT, HTH, or THH. Each outcome 
has probability (0.6) • (0.6) • (0.4) = 0.144. So the probability of two heads is 
3 -(0.144) = 0.432. 

Similarly the probability of one head is 3 • (0.6) • (0.4)^ = 0.288, and the prob- 
ability of no heads is (0.4)^ = 0.064. 

As a check, we have 

0.216 + 0.432 + 0.288 + 0.064= 1. 

1.13 Worked examples 

Question 

(a) You go to the shop to buy a toothbrush. The toothbrushes there are red, blue, 

green, purple and white. The probability that you buy a red toothbrush is 
three times the probability that you buy a green one; the probability that you 
buy a blue one is twice the probability that you buy a green one; the proba- 
bilities of buying green, purple, and white are all equal. You are certain to 
buy exactly one toothbrush. For each colour, find the probability that you 
buy a toothbrush of that colour. 

(b) James and Simon share a flat, so it would be confusing if their toothbrushes 

were the same colour. On the first day of term they both go to the shop to 
buy a toothbrush. For each of James and Simon, the probability of buying 
various colours of toothbrush is as calculated in (a), and their choices are 
independent. Find the probability that they buy toothbrushes of the same 
colour. 

(c) James and Simon live together for three terms. On the first day of each term 

they buy new toothbrushes, with probabilities as in (b), independently of 
what they had bought before. This is the only time that they change their 
toothbrushes. Find the probablity that James and Simon have differently 
coloured toothbrushes from each other for all three terms. Is it more likely 
that they will have differently coloured toothbrushes from each other for 
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all three terms or that they will sometimes have toothbrushes of the same 
colour? 

Solution 

(a) Let R, 5, G, P, W be the events that you buy a red, blue, green, purple and 

white toothbrush respectively. Let jc = P{G). We are given that 

P{R) = 3x, P{B) = 2x, P{P) = P{W) = X. 
Since these outcomes comprise the whole sample space. Corollary 2 gives 

3x + 2x + x + x + x= 1, 
sox= 1/8. Thus, the probabilities are 3/8, 1/4, 1/8, 1/8, 1/8 respectively. 

(b) Let RB denote the event 'James buys a red toothbrush and Simon buys a blue 

toothbrush', etc. By independence (given), we have, for example, 

P(RR) = (3/8) • (3/8) = 9/64. 

The event that the toothbrushes have the same colour consists of the five 
outcomes RR, BE, GG, PP, WW, so its probability is 

P{RR)+P{BB)+P{GG)+P{PP) +P{WW) 
9 1 1 1 1 _ 1 
64^16^64^64^64 ~ 4' 

(c) The event 'different coloured toothbrushes in the ith term' has probability 3/4 

(from part (b)), and these events are independent. So the event 'different 
coloured toothbrushes in all three terms' has probability 

3 3 3 _27 
4'4'4 ~ 64" 

The event 'same coloured toothbrushes in at least one term' is the comple- 
ment of the above, so has probability 1 — (27/64) = (37)/(64). So it is 
more likely that they will have the same colour in at least one term. 

Question There are 24 elephants in a game reserve. The warden tags six of the 
elephants with small radio transmitters and returns them to the reserve. The next 
month, he randomly selects five elephants from the reserve. He counts how many 

of these elephants are tagged. Assume that no elephants leave or enter the reserve, 
or die or give birth, between the tagging and the selection; and that all outcomes 
of the selection are equally likely. Find the probability that exactly two of the 
selected elephants are tagged, giving the answer correct to 3 decimal places. 
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Solution The experiment consists of picking the five elephants, not the original 
choice of six elephants for tagging. Let S be the sample space. Then \S\ = '^^Cs. 

Let A be the event that two of the selected elephants are tagged. This involves 
choosing two of the six tagged elephants and three of the eighteen untagged ones, 
so |A| = ■ i^Cs. Thus 

^(^) = 24Cg = 

to 3 d.p. 



Note: Should the sample should be ordered or unordered? Since the answer 
doesn't depend on the order in which the elephants are caught, an unordered sam- 
ple is preferable. If you want to use an ordered sample, the calculation is 

m= ' 2A% ' =0-288, 

since it is necessary to multiply by the ^€2 possible patterns of tagged and un- 
tagged elephants in a sample of five with two tagged. 



Question A couple are planning to have a family. They decide to stop having 
children either when they have two boys or when they have four children. Sup- 
pose that they are successful in their plan. 

(a) Write down the sample space. 

(b) Assume that, each time that they have a child, the probability that it is a 

boy is 1/2, independent of all other times. Find P{E) and P{F) where 
E = "there are at least two girls", F — "there are more girls than boys". 



Solution (a) S ^ {BB, BOB, GBB, BGGB, GBGB, GGBB, BGGG, GBGG, 
GGBG,GGGB,GGGG}. 
(b) E = {BGGB, GBGB, GGBB, BGGG, GBGG, GGBG, GGGB, GGGG}, 

F = {BGGG, GBGG, GGBG, GGGB, GGGG}. 
Now we have P{BB) = 1 /4, P{BGB) = 1 /8, P{BGGB) = 1/16, and similarly 
for the other outcomes. So P{E) = 8/16 = 1/2, P{F) = 5/16. 



Chapter 2 

Conditional probability 



In this chapter we develop the technique of conditional probability to deal with 
cases where events are not independent. 

2.1 What is conditional probability? 

Alice and Bob are going out to dinner. They toss a fair coin 'best of three' to 
decide who pays: if there are more heads than tails in the three tosses then Alice 
pays, otherwise Bob pays. 

Clearly each has a 50% chance of paying. The sample space is 

S = {HHH,HHT,HTH,HTT, THH, THT, TTH, TTT}, 

and the events 'Alice pays' and 'Bob pays' are respectively 

A = {HHH,HHT,HTH, THH}, 
B = {HTT, THT, TTH, TTT}. 

They toss the coin once and the result is heads; call this event E. How should 
we now reassess their chances? We have 

E = {HHH,HHT,HTH,HTT}, 

and if we are given the information that the result of the first toss is heads, then E 
now becomes the sample space of the experiment, since the outcomes not in E are 
no longer possible. In the new experiment, the outcomes 'Alice pays' and 'Bob 
pays' are 

AnE = {HHH,HHT,HTH}, 
BnE = {HTT}. 
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Thus the new probabUities that AUce and Bob pay for dinner are 3/4 and 1/4 
respectively. 

In general, suppose that we are given that an event E has occurred, and we 
want to compute the probability that another event A occurs. In general, we can no 
longer count, since the outcomes may not be equally likely. The correct definition 
is as follows. 

Let E be an event with non-zero probability, and let A be any event. The 
conditional probability of A given E is defined as 

P(A E) = \ , \ 
^ ' ^ P{E) 

Again I emphasise that this is the definition. If you are asked for the definition 
of conditional probability, it is not enough to say "the probability of A given that 
E has occurred", although this is the best way to understand it. There is no reason 
why event E should occur before event A! 

Note the verrira/ bar in the notation. This is P(A \ E), not P{A/E) or P{A\E). 

Note also that the definition only applies in the case where P{E) is not equal 
to zero, since we have to divide by it, and this would make no sense if P{E) — 0. 

To check the formula in our example: 

P{A I E) = 
P{B I E) = 



p{AnE) 


3/8 


3 


P{E) 


1/2 


"4' 


p{BnE) 


1/8 


1 


P{E) 


1/2 


~ 4' 



It may seem like a small matter, but you should be familiar enough with this 
formula that you can write it down without stopping to think about the names of 
the events. Thus, for example, 

, , , P(AnB) 

p(A B) = \ ' 
^ ' ^ P{B) 

ifP(B)^0. 



Example A random car is chosen among all those passing through Trafalgar 
Square on a certain day. The probability that the car is yellow is 3/100: the 
probability that the driver is blonde is 1/5; and the probability that the car is 
yellow and the driver is blonde is 1/50. 

Find the conditional probability that the driver is blonde given that the car is 
yellow. 
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Solution: If Y is the event 'the car is yellow' and B the event 'the driver is blonde', 
then we are given that P{Y) = 0.03, P{B) = 0.2, and P{Yr)B) = 0.02. So 

^ ' ' P{Y) 0.03 
to 3 d.p. Note that we haven't used all the information given. 

There is a connection between conditional probability and independence: 

Proposition 2.1 Let A and B be events with P{B) ^ 0. Then A and B are indepen- 
dent if and only ifP{A \ B) = P{A). 

Proof The words 'if and only if tell us that we have two jobs to do: we have to 
show that if A and B are independent, then P{A\B) ^ P{A); and that ifP{A\B) = 
P{A), then A and B are independent. 

So first suppose that A and B are independent. Remember that this means that 
P{Ar)B) = P{A)-P{B). Then 

P(A I B) - ^^^^^^ - - P(A) 

that is, P{A I B) = P{A), as we had to prove. 

Now suppose that P{A \ B) — P{A). In other words, 

P{B) ^ 

using the definition of conditional probability. Now clearing fractions gives 

P{AnB) = P{A)-P{B), 
which is just what the statement 'A and B are independent' means. 

This proposition is most likely what people have in mind when they say 'A 
and B are independent means that B has no effect on A' . 



2.2 Genetics 

Here is a simplified version of how genes code eye colour, assuming only two 
colours of eyes. 

Each person has two genes for eye colour. Each gene is either B or b. A child 
receives one gene from each of its parents. The gene it receives from its father 
is one of its father's two genes, each with probability 1/2; and similarly for its 
mother. The genes received from father and mother are independent. 

If your genes are BB or Bb or bB, you have brown eyes; if your genes are bb, 
you have blue eyes. 
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Example Suppose that John has brown eyes. So do both of John's parents. His 
sister has blue eyes. What is the probabiUty that John's genes are BB? 



Solution John's sister has genes bb, so one b must have come from each parent. 
Thus each of John's parents is Bb or bB; we may assume Bb. So the possibilities 
for John are (writing the gene from his father first) 

BB,Bb,bB,bb 

each with probability 1/4. (For example, John gets his father's B gene with prob- 
ability 1/2 and his mother's B gene with probability 1/2, and these are indepen- 
dent, so the probability that he gets BB is 1/4. Similarly for the other combina- 
tions.) 

Let X be the event 'John has BB genes' and Y the event 'John has brown 
eyes'. Then X = {BB} and Y = {BB,Bb,bB}. The question asks us to calculate 
P{X \Y). This is given by 

I Y) - — — - — - 1/3. 



2.3 The Theorem of Total Probability 

Sometimes we are faced with a situation where we do not know the probability of 
an event B, but we know what its probability would be if we were sure that some 
other event had occurred. 



Example An ice-cream seller has to decide whether to order more stock for the 
Bank Holiday weekend. He estimates that, if the weather is sunny, he has a 90% 
chance of selling all his stock; if it is cloudy, his chance is 60%; and if it rains, his 
chance is only 20%. According to the weather forecast, the probabiUty of sunshine 
is 30%, the probability of cloud is 45%, and the probability of rain is 25%. (We 
assume that these are all the possible outcomes, so that their probabilities must 
add up to 100%.) What is the overall probability that the salesman will sell all his 
stock? 

This problem is answered by the Theorem of Total Probability, which we now 
state. First we need a definition. The events Ai , A2, . . . ,A„ form a partition of the 
sample space if the following two conditions hold: 

(a) the events are pairwise disjoint, that is. A, n A^- = for any pair of events A, 
and A/; 



(b)AiUA2U---UA„ = 5. 
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Another way of saying the same thing is that every outcome in the sample space 
lies in exactly one of the events Ai,A2, . . . ,A„. The picture shows the idea of a 
partition. 



Ai 


A2 




An 



Now we state and prove the Theorem of Total Probability. 

Theorem 2.2 LetAi,A2, . . . ,A„form a partition of the sample space with P{Ai) ^ 
Ofor all i, and let B be any event. Then 

P{B) = j^P{B\Ai)-P{Ai). 

(=1 

Proof By definition, P{B \ Aj) = P{B n A;) /P(A;). Multiplying up, we find that 

P{Br\Ai)^P{B\Ai)-P{Ai). 

Now consider the events BnAi,BnA2,...,BnA„. These events are pairwise 
disjoint; for any outcome lying in both 5 fl A, and 5 n Aj would lie in both A, and 
Aj, and by assumption there are no such outcomes. Moreover, the union of all 
these events is B, since every outcome lies in one of the A,. So, by Axiom 3, we 
conclude that 

j^P{Br\Ai)^P{B). 
(=1 

Substituting our expression for P{B fl A;) gives the result. 



c 

Ai 






) 

An 


B 









Consider the ice-cream salesman at the start of this section. Let A\ be the 
event 'it is sunny', A2 the event 'it is cloudy', and A3 the event 'it is rainy'. Then 
Ai, A2 and A3 form a partition of the sample space, and we are given that 



P(Ai) = 0.3, P(A2) = 0.45, P(A3) = 0.25. 
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Let B be the event 'the salesman sells all his stock' . The other information we are 
given is that 

P(5|Ai)=0.9, P(5|A2)=0.6, P{B\ A3) =0.2. 
By the Theorem of Total Probability, 

P{B) = (0.9 X 0.3) + (0.6 X 0.45) + (0.2 x 0.25) = 0.59. 

You will now realise that the Theorem of Total Probability is really being used 
when you calculate probabilities by tree diagrams. It is better to get into the habit 
of using it directly, since it avoids any accidental assumptions of independence. 

One special case of the Theorem of Total Probability is very commonly used, 
and is worth stating in its own right. For any event A, the events A and A' form a 
partition of S. To say that both A and A' have non-zero probability is just to say 
that P{A) 7^ 0, 1. Thus we have the following corollary: 

Corollary 2.3 Let A and B be events, and suppose that P(A) 0, 1. Then 

P{B) =P{B\A)- P{A) + P{B I A') ■ P(A') . 



2.4 Sampling revisited 

We can use the notion of conditional probability to treat sampling problems in- 
volving ordered samples. 

Example I have two red pens, one green pen, and one blue pen. I select two 
pens without replacement. 

(a) What is the probability that the first pen chosen is red? 

(b) What is the probability that the second pen chosen is red? 

For the first pen, there are four pens of which two are red, so the chance of 
selecting a red pen is 2/4 = 1/2. 

For the second pen, we must separate cases. Let Ai be the event 'first pen red', 
A 2 the event 'first pen green' andAs the event 'first pen blue'. Then/'(Ai) = 1/2, 
P(A2) = ^(As) = 1/4 (arguing as above). Let B be the event 'second pen red'. 

If the first pen is red, then only one of the three remaining pens is red, so that 
P{B I Ai) = 1/3. On the other hand, if the first pen is green or blue, then two of 
the remaining pens are red, so P{B \ A2) = P{B \ A3) = 2/3. 



2.5. BAYES' THEOREM 
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By the Theorem of Total ProbabiUty, 



P{B) 



P{B I Ai)P(Ai) I A2)P{A2)+P{B \ A^)P{A^) 

(1/3) X (1/2) + (2/3) X (1/4) + (2/3) X (1/4) 
1/2. 



We have reached by a roundabout argument a conclusion which you might 
think to be obvious. If we have no information about the first pen, then the second 
pen is equally likely to be any one of the four, and the probability should be 1/2, 
just as for the first pen. This argument happens to be correct. But, until your 
ability to distinguish between correct arguments and plausible-looking false ones 
is very well developed, you may be safer to stick to the calculation that we did. 
Beware of obvious-looking arguments in probability! Many clever people have 
been caught out. 

2.5 Bayes' Theorem 

There is a very big difference between P{A \ B) and P{B \ A). 

Suppose that a new test is developed to identify people who are liable to suffer 
from some genetic disease in later life. Of course, no test is perfect; there will be 
some carriers of the defective gene who test negative, and some non-carriers who 
test positive. So, for example, let A be the event 'the patient is a carrier', and B 
the event 'the test result is positive'. 

The scientists who develop the test are concerned with the probabilities that 
the test result is wrong, that is, with P{B \ A') and P{B' \ A). However, a patient 
who has taken the test has different concerns. If I tested positive, what is the 
chance that I have the disease? If I tested negative, how sure can I be that I am not 
a carrier? In other words, P{A \ B) and P{A' \ B'). 

These conditional probabilities are related by Bayes' Theorem: 

Theorem 2.4 Let A andB be events with non-zero probability. Then 



P{A I B) 



P{B\A)-P{A) 
W) 



The proof is not hard. We have 



P{A I B) P{B) = P{AnB) = P{B I A) •P(A), 



using the definition of conditional probability twice. (Note that we need both A 
and B to have non-zero probability here.) Now divide this equation by P{B) to get 
the result. 
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If P{A) 7^ 0, 1 and P{B) ^ 0, then we can use Corollary 17 to write this as 

P{B\A)-P{A) 
' ~ P{B\A)-P{A)+P{B\A')-P{A'y 

B ayes' Theorem is often stated in this form. 

Example Consider the ice-cream salesman from Section 2.3. Given that he sold 
all his stock of ice-cream, what is the probability that the weather was sunny? 
(This question might be asked by the warehouse manager who doesn't know what 
the weather was actually like.) Using the same notation that we used before, Ai 
is the event 'it is sunny' and B the event 'the salesman sells all his stock'. We are 
asked for P{Ai | 5). We were given that P{B \ Ai) = 0.9 and that P(Ai) = 0.3, and 
we calculated that P{B) = 0.59. So by Bayes' Theorem, 

^ ' ^ P{B) 0.59 

to 2 d.p. 

Example Consider the clinical test described at the start of this section. Suppose 
that 1 in 1000 of the population is a carrier of the disease. Suppose also that the 
probability that a carrier tests negative is 1%, while the probability that a non- 
carrier tests positive is 5%. (A test achieving these values would be regarded as 
very successful.) Let A be the event 'the patient is a carrier', and B the event 'the 
test result is positive'. We are given that P(A) = 0.001 (so that P{A') = 0.999), 
and that 

P{B I A) = 0.99, P{B I A') = 0.05. 

(a) A patient has just had a positive test result. What is the probability that the 
patient is a carrier? The answer is 



P{A I B) = 



P{B I A)P(A) 



P{B I A)P{A) +P{B I A')P{A') 

0.99 X 0.001 
(0.99 X 0.001) + (0.05 X 0.999) 
0.00099 



= 0.0194. 
0.05094 

(b) A patient has just had a negative test result. What is the probability that the 
patient is a carrier? The answer is 

P{A\B') = 



P{B' 


A)P{A) 


P{B' 


\A)P{A)+P{B' 


A')P{A') 
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0.01 X 0.001 



(0.01 X 0.001) + (0.95x0.999) 
0.00001 



0.94095 



0.00001. 



So a patient with a negative test result can be reassured; but a patient with a posi- 
tive test result still has less than 2% chance of being a carrier, so is likely to worry 
unnecessarily. 

Of course, these calculations assume that the patient has been selected at ran- 
dom from the population. If the patient has a family history of the disease, the 
calculations would be quite different. 



Example 2% of the population have a certain blood disease in a serious form; 
10% have it in a mild form; and 88% don't have it at all. A new blood test is 
developed; the probability of testing positive is 9/10 if the subject has the serious 
form, 6/10 if the subject has the mild form, and 1/10 if the subject doesn't have 
the disease. 

I have just tested positive. What is the probability that I have the serious form 
of the disease? 

Let Ai be 'has disease in serious form', A2 be 'has disease in mild form', and 
A3 be 'doesn't have disease'. Let B be 'test positive'. Then we are given that Ai, 
A2, A3 form a partition and 

P(Ai)=0.02 P(A2) = 0.1 /'(A3)=0.88 
P(5|Ai) = 0.9 P{B\A2)=0.6 P{B\ A3) = 0.1 

Thus, by the Theorem of Total Probability, 

P{B) = 0.9 X 0.02 + 0.6 X 0. 1 + 0. 1 X 0.88 = 0. 166, 

and then by Bayes' Theorem, 

, , P(B\Ai)P(Ai) 0.9x0.02 „ 

to 3 d.p. 



2.6 Iterated conditional probability 

The conditional probability of C, given that both A and B have occurred, is just 
P{C \AnB). Sometimes instead we just write P{C | A,5). It is given by 

^ ' ' ^ p{AnB) ' 
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so 

p{AnBnc) = P{c \A,B)p{AnB). 

Now we also have 

P{AnB) = P{B\A)P{A), 
so finally (assuming that P{Ar\B) ^ 0), we have 

P{AnBnC)^P{C\A,B)P{B\A)P{A). 

This generalises to any number of events: 

Proposition 2.5 Let Ai, . . . ,A„ be events. Suppose that P{Ai fl • • • nA„_i) ^ 0. 

p(Ai nA2 n ■ ■ ■ nA„) = p(a„ | Ai, . . . ,a„_i) • ■■p{A2 \ Ai)p(Ai). 

We apply this to the birthday paradox. 

The birthday paradox is the following statement: 

If there are 23 or more people in a room, then the chances are better 
than even that two of them have the same birthday. 

To simplify the analysis, we ignore 29 February, and assume that the other 365 
days are all equally likely as birthdays of a random person. (This is not quite true 
but not inaccurate enough to have much effect on the conclusion.) Suppose that 
we have n people pi,p2, • • • Let A2 be the event 'p2 has a different birthday 
from pi'. Then /'(A2) = 1 — 3^5, since whatever pi's birthday is, there is a 1 in 
365 chance that p2 will have the same birthday. 

Let A3 be the event '/>3 has a different birthday from pi and p2'. It is not 
straightforward to evaluate ^(As), since we have to consider whether pi and p2 
have the same birthday or not. (See below). But we can calculate that P{A3 \ 
A2) = 1 ~ since if A2 occurs then pi and p2 have birthdays on different days, 
and A3 will occur only if p^'s birthday is on neither of these days. So 

P(A2 nA3) = P(A2)P(A3 I A2) = il-^,){l- 3^5). 

What is A2 nA3? It is simply the event that all three people have birthdays on 
different days. 

Now this process extends. If A,- denotes the event 'pi's birthday is not on the 
same day as any of pi, . . . ,Pi-i', then 

P(A,-|Ai,...,A;_i) = l-|^, 
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and so by Proposition 2.5, 



p{Ain---nAi)^{i 



365 




Call this number qf, it is the probability that all of the people pi,...,pi have 
their birthdays on different days. 

The numbers qt decrease, since at each step we multiply by a factor less than 1. 
So there will be some value of n such that 



that is, n is the smallest number of people for which the probability that they all 
have different birthdays is less than 1 /2, that is, the probability of at least one 

coincidence is greater than 1/2. 

By calculation, we find that q22 = 0.5243, q23 = 0.4927 (to 4 d.p.); so 23 
people are enough for the probability of coincidence to be greater than 1/2. 

Now return to a question we left open before. What is the probability of the 
event A3? (This is the event that has a different birthday from both pi and p2.) 

If pi and p2 have different birthdays, the probability is 1 — jfj: this is the 
calculation we already did. On the other hand, if pi and p2 have the same birthday, 
then the probability is 1 — 355 • These two numbers are /'(A3 | A2) and P{A2 \ A'2) 
respectively. So, by the Theorem of Total Probability, 



Problem How many people would you need to pick at random to ensure that 
the chance of two of them being bom in the same month are better than even? 

Assuming all months equally likely, if 5, is the event that pi is bom in a dif- 
ferent month from any of pi , . . . then as before we find that 



q„-l > 0.5, 



qn < 0.5, 



^(^3) 




= 0.9945 



to 4 d.p. 



P{Bi\Bi,--,Bi_i)^l 



tzl 
12 ' 



SO 



p{Bin---nBi)^{i 

We calculate that this probability is 



12 



)(1 




(11/12) X (10/12) X (9/12) = 0.5729 
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for / = 4 and 

(11/12) X (10/12) X (9/12) X (8/12) =0.3819 

for / = 5. So, with five people, it is more likely that two will have the same birth 
month. 



A true story. Some years ago, in a probability class with only ten students, the 
lecturer started discussing the Birthday Paradox. He said to the class, "I bet that 
no two people in the room have the same birthday". He should have been on safe 
ground, since qn = 0.859. (Remember that there are eleven people in the room!) 
However, a student in the back said "I'll take the bet", and after a moment all the 
other students realised that the lecturer would certainly lose his wager. Why? 
(Answer in the next chapter.) 



2.7 Worked examples 

Question Each person has two genes for cystic fibrosis. Each gene is either N 
or C. Each child receives one gene from each parent. If your genes are A^A^ or NC 
or CN then you are normal; if they are CC then you have cystic fibrosis. 

(a) Neither of Sally's parents has cystic fibrosis. Nor does she. However, Sally's 

sister Hannah does have cystic fibrosis. Find the probability that Sally has 
at least one C gene (given that she does not have cystic fibrosis). 

(b) In the general population the ratio of A^^ genes to C genes is about 49 to 1. 

You can assume that the two genes in a person are independent. Harry does 
not have cystic fibrosis. Find the probability that he has at least one C gene 
(given that he does not have cystic fibrosis). 

(c) Harry and Sally plan to have a child. Find the probabiUty that the child will 

have cystic fibrosis (given that neither Harry nor Sally has it). 



Solution During this solution, we will use a number of times the following prin- 
ciple. Let A and B be events with AC.B. Then ADB — A, and so 

P(A I B) - ^^^^^^ - ^^^^ 



(a) This is the same as the eye colour example discussed earlier. We are given 
that Sally's sister has genes CC, and one gene must come from each parent. But 
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neither parent is CC, so each parent is CN or NC. Now by the basic rules of 
genetics, all the four combinations of genes for a child of these parents, namely 
CC,CN,NC,NN, will have probability 1 /4. 

If is the event 'Sally has at least one C gene', then Si = {CN.NC^CC}; and 
if ^2 is the event 'Sally does not have cystic fibrosis', then 5*2 = {CN,NC,NN}. 
Then 

PiSinS2) 2/4 2 



P{Si I Si) = 



P{S2) 3/4 



(b) We know nothing specific about Harry, so we assume that his genes are 
randomly and independently selected from the population. We are given that the 
probability of a random gene being C or is 1/50 and 49/50 respectively. Then 
the probabilities of Harry having genes CC, CN, NC, NN are respectively ( 1 / 50)^, 
(1/50) • (49/50), (49/50) • (1/50), and (49/50)^, respectively. So, if Hi is the 
event 'Harry has at least one C gene', and H2 is the event 'Harry does not have 
cystic fibrosis', then 

P(H I H ) = ^(^1^-^^) = (49/2500) + (49/2500) ^ ^ 

^ ^' PiH2) (49/2500) + (49/2500) + (2401/2500) 51' 

(c) Let X be the event that Harry's and Sally's child has cystic fibrosis. As in 
(a), this can only occur if Harry and Sally both have CN or NC genes. That is, 
X CS^nH^, where ^3 = Si n S2 and = Hir\H2. Now if Harry and Sally are 
both CN or NC, these genes pass independently to the baby, and so 

Fix I S3 n/fs) = , = \. 

(Remember the principle that we started with!) 

We are asked to find P{X \ S2 n//2), in other words (since X C S3 n/Zs C 

S2nH2), 

PiX) 



P{S2nH2) 

Now Harry's and Sally's genes are independent, so 

PiS3nH3) = P(S3)-P(//3), 
P{S2nH2) = P{S2)-P{H2). 



Thus, 



p{x) p{x) P{S3nH3) 



p(S2n//2) Pis^nHs) p{S2riH2) 
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1 PjSinSz) P{HinH2) 

4' P{S2) ' P{H2) 

= ^■PiSi\S2)-PiHi\H2) 

_ 1 2 2 

4'3'5T 

1 

153' 

I thank Eduardo Mendes for pointing out a mistake in my previous solution to 
this problem. 

Question The Land of Nod lies in the monsoon zone, and has just two seasons, 
Wet and Dry. The Wet season lasts for 1/3 of the year, and the Dry season for 2/3 
of the year. During the Wet season, the probability that it is raining is 3/4; during 
the Dry season, the probability that it is raining is 1/6. 

(a) I visit the capital city, Oneirabad, on a random day of the year. What is the 

probability that it is raining when I arrive? 

(b) I visit Oneirabad on a random day, and it is raining when I arrive. Given this 

information, what is the probability that my visit is during the Wet season? 

(c) I visit Oneirabad on a random day, and it is raining when I arrive. Given this 

information, what is the probability that it will be raining when I return to 
Oneirabad in a year's time? 

(You may assume that in a year's time the season will be the same as today but, 
given the season, whether or not it is raining is independent of today's weather.) 

Solution (a) Let W be the event 'it is the wet season', D the event 'it is the dry 

season', and R the event 'it is raining when I arrive'. We are given that P{W) = 
1 /3, P{D) = 2/3, P{R I W) = 3/4, P{R \ D) = I /6. By the ToTP, 

P{R) = P{R\W)P{W)+P{R\D)P{D) 

= (3/4) -(1/3) + (1/6) -(2/3) = 13/36. 

(b) By B ayes' Theorem, 

^yy \^) p^^^ ^3^3^ J3- 
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(c) Let R' be the event 'it is raining in a year's time'. The information we are 
given is that P{RnR' \ W) = P{R \ W)P{R' \ W) and similarly for D. Thus 

P{RnR') = P(RnR' \W)P{W)+P{RnR' \D)P{D) 
= (3/4)2.(l/3) + (l/6)2.(2/3) = ^, 

and so 

P{R'\R)^^^^^^'^ -^^^"^^^ - 



P{R) 13/36 156' 
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Chapter 3 
Random variables 



In this chapter we define random variables and some related concepts such as 
probability mass function, expected value, variance, and median; and look at some 
particularly important types of random variables including the binomial, Poisson, 
and normal. 

3.1 What are random variables? 

The Holy Roman Empire was, in the words of the historian Voltaire, "neither holy, 
nor Roman, nor an empire". Similarly, a random variable is neither random nor a 
variable: 

A random variable is a function defined on a sample space. 

The values of the function can be anything at all, but for us they will always be 
numbers. The standard abbreviation for 'random variable' is r.v. 

Example I select at random a student from the class and measure his or her 
height in centimetres. 

Here, the sample space is the set of students; the random variable is 'height', 
which is a function from the set of students to the real numbers: h{S) is the height 
of student S in centimetres. (Remember that a function is nothing but a rule for 
associating with each element of its domain set an element of its target or range 
set. Here the domain set is the sample space S, the set of students in the class, and 
the target space is the set of real numbers.) 

Example I throw a six-sided die twice; I am interested in the sum of the two 
numbers. Here the sample space is 

S = {{iJ):l<iJ<6}, 
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and the random variable F is given by F{i,j) = i + j. The target set is the set 
{2, 3,..., 12}. 

The two random variables in the above examples are representatives of the two 
types of random variables that we will consider. These definitions are not quite 
precise, but more examples should make the idea clearer. 

A random variable F is discrete if the values it can take are separated by gaps. 
For example, F is discrete if it can take only finitely many values (as in the second 
example above, where the values are the integers from 2 to 12), or if the values of 
F are integers (for example, the number of nuclear decays which take place in a 
second in a sample of radioactive material - the number is an integer but we can't 
easily put an upper limit on it.) 

A random variable is continuous if there are no gaps between its possible 
values. In the first example, the height of a student could in principle be any real 
number between certain extreme limits. A random variable whose values range 
over an interval of real numbers, or even over all real numbers, is continuous. 

One could concoct random variables which are neither discrete nor continuous 
(e.g. the possible, values could be 1,2, 3, or any real number between 4 and 5), 
but we will not consider such random variables. 

We begin by considering discrete random variables. 

3.2 Probability mass function 

Let F be a discrete random variable. The most basic question we can ask is: given 
any value a in the target set of F, what is the probability that F takes the value a? 
In other words, if we consider the event 

A = {xeS : F{x) = a} 

what is ^(A)? (Remember that an event is a subset of the sample space.) Since 
events of this kind are so important, we simpUfy the notation: we write 

P{F = a) 

in place of 

P{{xeS ■.F{x) = a}). 

(There is a fairly common convention in probability and statistics that random 
variables are denoted by capital letters and their values by lower-case letters. In 
fact, it is quite common to use the same letter in lower case for a value of the 
random variable; thus, we would write P{F — f) in the above example. But 
remember that this is only a convention, and you are not bound to it.) 
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The probability mass function of a discrete random variable F is the function, 
formula or table which gives the value of P{F = a) for each element a in the target 
set of F. If F takes only a few values, it is convenient to list it in a table; otherwise 
we should give a formula if possible. The standard abbreviation for 'probability 
mass function' is p.m.f. 

Example I toss a fair coin three times. The random variable X gives the number 
of heads recorded. The possible values of Z are 0, 1 , 2, 3, and its p.m.f. is 



a 


12 3 


P{X = a) 


13 3 1 
8 8 8 8 



For the sample space is {HHH,HHT,HTH,HTT, THH, THT, TTH, TTT}, and 
each outcome is equally Ukely. The event Z = 1, for example, when written as a 
set of outcomes, is equal to {HTT, THT, TTH}, and has probability 3/8. 

Two random variables X and Y are said to have the same distribution if the 
values they take and their probability mass functions are equal. We write X ~ 7 
in this case. 

In the above example, if Y is the number of tails recorded during the experi- 
ment, then X and Y have the same distribution, even though their actual values are 
different (indeed, Y = 3-X). 

3.3 Expected value and variance 

Let X be a discrete random variable which takes the values a^,. . . ,a„. The ex- 
pected value or mean of X is the number ^(X) given by the formula 

EiX) = j^aiP{X = ai). 

That is, we multiply each value of X by the probability that X takes that value, 
and sum these terms. The expected value is a kind of 'generalised average' : if 
each of the values is equally likely, so that each has probability 1/n, then E{X) = 
{a\A h fl„) jn, which is just the average of the values. 

There is an interpretation of the expected value in terms of mechanics. If we 
put a mass pi on the axis at position a, for « = 1 , . . . , n, where pi = P{X = ai), then 
the centre of mass of all these masses is at the point ^(X). 

If the random variable X takes infinitely many values, say 01,02,03, .. ., then 
we define the expected value of X to be the infinite sum 

00 

i=l 



42 



CHAPTERS. RANDOM VARIABLES 



Of course, now we have to worry about whether this means anything, that is, 
whether this infinite series is convergent. This is a question which is discussed 
at great length in analysis. We won't worry about it too much. Usually, discrete 
random variables will only have finitely many values; in the few examples we 
consider where there are infinitely many values, the series will usually be a ge- 
ometric series or something similar, which we know how to sum. In the proofs 
below, we assume that the number of values is finite. 
The variance of X is the number Var(Z) given by 

YaT{X) = E{X^)-E{Xf. 

Here, X^ is just the random variable whose values are the squares of the values of 
X. Thus 

E{X^) = f^ajP{X = ai) 

i=l 

(or an infinite sum, if necessary). The next theorem shows that, if E{X) is a kind 
of average of the values of X, then Var(Z) is a measure of how spread-out the 
values are around their average. 

Proposition 3.1 LetX be a discrete random variable with E{X) — /j. Then 

Var(Z) = E{iX-ijf) = = a,-)- 

i=l 

For the second term is equal to the third by definition, and the third is 
£(a,-//)2p(X = flO 

i=l 

= j^{a}-2iiai + i?)P{X = ai) 

i=\ 

(What is happening here is that the entire sum consists of n rows with three terms 
in each row. We add it up by columns instead of by rows, getting three parts with 
n terms in each part.) Continuing, we find 

E{{X-^if) = £(z2)-2AiE(Z)+//2 
= E(X^)-E(xf, 

and we are done. (Remember that E{X) = /j, and that 'L'i=if{X = (^i) = 1 since 
the events X = ai form a partition.) 
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Some people take the conclusion of this proposition as the definition of vari- 
ance. 

Example I toss a fair coin three times; X is the number of heads. What are the 
expected value and variance of Z? 

£:(Z) = Ox (l/8) + l X (3/8) + 2 X (3/8) + 3 X (1/8) = 3/2, 
Var(Z) = 02 X (1/8) + 1^ x (3/8) +2^ x (3/8) + 3^ x (1/8) - (3/2)^ = 3/4. 
If we calculate the variance using Proposition 3.1, we get 

3\^ 1 / 1\^ 3 /iV 3 fsV 1 3 



Two properties of expected value and variance can be used as a check on your 
calculations. 

• The expected value of X always lies between the smallest and largest values 
ofZ. 

• The variance of X is never negative. (For the formula in Proposition 3.1 is 
a sum of terms, each of the form (a, — /j)^ (a square, hence non-negative) 
times P{X — ai) (a probability, hence non-negative). 



3.4 Joint p.m.f. of two random variables 

Let Z be a random variable taking the values ai, . . . ,a„, and let 7 be a random 
variable taking the values b\,..., bm- We say that X and Y are independent if, for 
any possible values / and j, we have 

PiX = auY = bj) = P{X = ai) -PiY = bj). 

Here P{X = a,, F = bj) means the probability of the event that X takes the value 
ai and Y takes the value bj. So we could re-state the definition as follows: 

The random variables X and Y are independent if, for any value a; of 
X and any value bj of Y, the events X = a/ and Y = bj are independent 
(events). 

Note the difference between 'independent events' and 'independent random vari- 
ables'. 
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Example In Chapter 2, we saw the following: I have two red pens, one green 
pen, and one blue pen. I select two pens without replacement. Then the events 
'exactly one red pen selected' and 'exactly one green pen selected' turned out to 
be independent. Let X be the number of red pens selected, and Y the number of 
green pens selected. Then 

p(X = 1,Y = I) = P{X = I) ■ P{Y = I). 

Are X and Y independent random variables? 

No, because P{X = 2) = 1/6, P{Y = 1) - 1/2, but P{X = 2,Y=l) = (it is 
impossible to have two red and one green in a sample of two). 

On the other hand, if I roll a die twice, and X and Y are the numbers that come 
up on the first and second throws, then X and Y will be independent, even if the 
die is not fair (so that the outcomes are not all equally likely). 

If we have more than two random variables (for example X,y,Z), we say that 
they are mutually independent if the events that the random variables take specific 
values (for example, X — a, Y — b, Z — c) dSQ mutually independent. (You may 
want to revise the material on mutually independent events.) 

What about the expected values of random variables? For expected value, it is 
easy, but for variance it helps if the variables are independent: 

Theorem 3.2 Let X and Y be random variables. 

(a) E{X + Y) = E{X)+E{Y). 

(b) IfX and Y are independent, then Var(Z + F) = Var(Z) + Var(y). 
We will see the proof later. 

If two random variables X and Y are not independent, then knowing the p.m.f. 
of each variable does not tell the whole story. The joint probability mass function 
(or joint p.m.f.) of X and Y is the table giving, for each value at of X and each 
value bj of Y, the probability that X — ai and Y — bj. We arrange the table so 
that the rows correspond to the values of X and the columns to the values of Y. 
Note that summing the entries in the row corresponding to the value a; gives the 
probability that X = af, that is, the row sums form the p.m.f. of X. Similarly the 
column sums form the p.m.f. of Y. (The row and column sums are sometimes 
called the marginal distributions or marginals.) 

In particular, X and Y are independent r.v.s if and only if each entry of the 
table is equal to the product of its row sum and its column sum. 
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Example I have two red pens, one green pen, and one blue pen, and I choose 
two pens without replacement. Let X be the number of red pens that I choose and 
Y the number of green pens. Then the joint p.m.f. of X and Y is given by the 
following table: 

Y 










1 










1 
6 


X 


1 


1 

3 


1 

3 




2 


1 

6 






The row and column sums give us the p.m.f.s for X and Y: 



a 


1 2 


P{X = a) 


1 2 1 
6 3 6 



b 


1 


P{Y = h) 


1 1 

2 2 



Now we give the proof of Theorem 3.2. 

We consider the joint p.m.f. of X and Y . The random variable X-\-Y takes the 
values Ui + £> ^ for / = 1 , . . . , n and j = 1 , . . . , m. Now the probability that it takes 
a given value Ck is the sum of the probabilities P(Z = a;, F = bj) over all / and j 
such that Ui + bj = Ck- Thus, 

E{X + Y) = Y,CkP{X + Y ^Ck) 
k 

n m 



n m 



m n 



at P{X = ai,Y = bj)] + (Y bj YP{X = ai,Y = bj) 

^i=l 7=1 / \j=l i=l 

Now LjLi = ai, Y = bj) is a row sum of the joint p.m.f. table, so is equal to 
P{X = Ui), and similarly ^"^j P(X = ai, Y = bj) is a column sum and is equal to 
P{Y = bj). So 

n m 

E{X + Y) = £a,'P(X=aO + £Z7,-P(y = Z7y) 

i=l 7=1 

= E{X) + E(Y). 



The variance is a bit trickier. First we calculate 



E({X + Y)^) = E(X^ + 2XY + Y^) = E{X^) + 2E(XY) + E(Y^), 
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using part (a) of the Theorem. We have to consider the term E{XY). For this, we 
have to make the assumption that X and Y are independent, that is, 

pQC = ai,Y = bj) = P{X = at) ■ P{Y = bj). 

As before, we have 

n m 

E{XY) = Y^lL^i^jPi^-ciuY^bj) 

i=\j=\ 

= j^j^aibjP(X=ai)P(Y = bj) 

i=\j=\ 

= atPiX = a,) j • 1^ £ bjPiY = bj)^ 

So 

Var(Z + y) = E{{X + Yf)-{E{X + Y)f 

= {E{X^) + 2E{XY) +E{Y^)) - {E{X)^ + 2E{X)E{Y) +E{Y)^) 
= {E{X^) -E{Xf) + 2{E{XY) -E{X)E{Y)) + {E{Y^) - E{Yf) 
= Var(X)+Var(y). 

To finish this section, we consider constant random variables. (If the thought 
of a 'constant variable' worries you, remember that a random variable is not a 
variable at all but a function, and there is nothing amiss with a constant function.) 

Proposition 3.3 Let C be a constant random variable with value c. Let X be any 

random variable. 

(a) E{C) = c, Var(C) = 0. 

(b) E{X + c)=E{X) + c, Var(X + c) =Var(X). 

(c) E{cX) = cE{X), Var(cX) = Var(X). 

Proof (a) The random variable C takes the single value c with P{C — c) — 1. So 
E{C) = c-\ = c. Also, 

Var(C) = E(C^) - E(cf = c^-c^ = 0. 

(For is a constant random variable with value c^.) 
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(b) This follows immediately from Theorem 3.2, once we observe that the 
constant random variable C and any random variable X are independent. (This is 
true because P{X = a,C = c) = P{X = a) ■ 1.) Then 

E(X + c) = E(X) +E(C) = E(X) + c, 
Var(X + c) = Var(Z) + Var(C) = Var(Z) . 

(c) If ai , . . . , an are the values of X, then cai , . . . , ca„ are the values of cX, and 
P{cX = cai) = P{x = ai). So 

n 

E{cX) = ^caiP{cX = cai) 

i=l 
n 

= c^aiP{X = ai) 

i=l 

= cE{X). 



Then 



Var(cZ) = E{c^X^)-E{cX)^ 

= c^E{X^)-{cE{X))^ 

= c^{E{X^)-E{X)^) 

= c^Yar{X). 



3.5 Some discrete random variables 

We now look at five types of discrete random variables, each depending on one or 
more parameters. We describe for each type the situations in which it arises, and 
give the p.m.f., the expected value, and the variance. If the variable is tabulated 
in the New Cambridge Statistical Tables, we give the table number, and some 
examples of using the tables. You should have a copy of the tables to follow the 
examples. 

A summary of this information is given in Appendix B. 

Before we begin, a comment on the New Cambridge Statistical Tables. They 
don't give the probability mass function (or p.m.f.), but a closely related function 
called the cumulative distribution function. It is defined for a discrete random 
variable as follows. 

Let X be a random variable taking values ai , ai, . . . , a„. We assume that these 
are arranged in ascending order: ai < a2 < ■•• < an. The cumulative distribution 
function, or c.d.f., of X is given by 



Fx{ai) = P{X<ai). 
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We see that it can be expressed in terms of the p.m.f. of X as follows: 
Fx{ai) = P(Z = fli) + • • • +P{X = ai) = £ P(X = aj). 

7=1 

In the other direction, we cn recover the p.m.f. from the c.d.f.: 

P{X = a,) = Fxiai) - Fxiai-i). 

We won't use the c.d.f. of a discrete random variable except for looking up 
the tables. It is much more important for continuous random variables! 

Bernoulli random variable Bernoulli (/?) 

A Bernoulli random variable is the simplest type of all. It only takes two values, 
and 1. So its p.m.f. looks as follows: 

X 1 

P{X =x)\q p 

Here, p is the probability that X = I; it can be any number between and 1. 
Necessarily q (the probability that X = 0) is equal to I — p. So p determines 
everything. 

For a Bernoulli random variable X, we sometimes describe the experiment as 
a 'trial', the event Z = 1 as 'success', and the event X = as 'failure'. 

For example, if a biased coin has probability p of coming down heads, then 
the number of heads that we get when we toss the coin once is a Bernoulli(p) 
random variable. 

More generally, let A be any event in a probability space S. With A, we asso- 
ciate a random variable Ia (remember that a random variable is just a function on 
S) by the rule 

, / X r 1 ifseA; 

'^^'^^io iUiA. 
The random variable I a is called the indicator variable of A, because its value 
indicates whether or not A occurred. It is a Bernoulli (/?) random variable, where 
p — P{A). (The event /a = 1 is just the event A.) Some people write 1a instead of 

Calculation of the expected value and variance of a Bernoulli random variable 
is easy. Let X ~ Bemoulli(p). (Remember that ~ means "has the same p.m.f. 
as".) 

E{X)=Q-q+\-p = p- 

Wdx{X) = Q>^-q+\^-p-p^ = p-p^ = pq. 
(Remember that q=\—p.) 
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Binomial random variable Bin{n,p) 

Remember that for a Bernoulli random variable, we describe the event X = 1 as a 
'success'. Now a binomial random variable counts the number of successes in n 
independent trials each associated with a Bemoulli(j!?) random variable. 

For example, suppose that we have a biased coin for which the probability of 
heads is p. We toss the coin n times and count the number of heads obtained. This 
number is a Bin(n,/>) random variable. 

A Bin{n,p) random variable X takes the values 0, 1,2, ... ,n, and the p.m.f. of 
X is given by 

P(X = k)= "Ckq^-^p^ 

for A: = 0, 1 , 2, . . . , n, where q= 1 — /?. This is because there are "Q different ways 
of obtaining k heads in a sequence of n throws (the number of choices of the k 
positions in which the heads occur), and the probability of getting k heads and 
n — k tails in a particular order is (f~^p^. 

Note that we have given a formula rather than a table here. For small values 
we could tabulate the results; for example, for Bin(4,/7): 



k 


1 2 3 4 


P{X = k) 


q"^ Aq^p 6q^p^ 4qp^ p^ 



Note: when we add up all the probabiUties in the table, we get 

k=0 

as it should be: here we used the binomial theorem 

k=0 

(This argument explains the name of the binomial random variable!) 
If X ~ Bin{n,p), then 

E{X) = np, Var(A') = npq. 

There are two ways to prove this, an easy way and a harder way. The easy way 
only works for the binomial, but the harder way is useful for many random vari- 
ables. However, you can skip it if you wish: I have set it in smaller type for this 
reason. 

Here is the easy method. We have a coin with probability p of coming down 
heads, and we toss it n times and count the number X of heads. Then X is our 
Bin(n,p) random variable. Let X^ be the random variable defined by 

^ _ r 1 if we get heads on the kth toss, 
\0 if we get tails on the kth toss. 
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In other words, is the indicator variable of the event 'heads on the kth toss'. 
Now we have 

X=Xi+X2 + ---+Xn 

(can you see why?), andXi, . . . ,X„ are independent Bemou\\i{p) random variables 
(since they are defined by different tosses of a coin). So, as we saw earlier, E{Xi) = 
p, Var(Z,) = pq. Then, by Theorem 21, since the variables are independent, we 
have 

E{X) = p + p-\ \-p = np, 

Var(X) = pq + pq-\ \-pq = npq. 

The other method uses a gadget called the probability generating function. We only use it 
here for calculating expected values and variances, but if you learn more probability theory you 
will see other uses for it. Let X be a random variable whose values are non-negative integers. (We 
don't insist that it takes all possible values; this method is fine for the binomial Bin(n,p), which 
takes values between and n. To save space, we write pk for the probability P{X = k). Now the 
probability generating function of X is the power series 

(The sum is over all values k taken by X.) 

We use the notation [F(x)]jc=i for the result of substituting x = 1 in the series F{x). 

Proposition 3.4 Let Gx{x) be the probability generating function of a random variable X. Then 

(a) [Gx{x)],=i = 1; 

(b) E{X) = [iGx{x)]^^,; 

(c) Var(X) = [^Gx(x)l +E{X)-E{Xf. 

Lot Ax=l 

Part (a) is just the statement that probabiUties add up to 1: when we substitute x = 1 in the 
power series for Gx{x) we just get ^p^;. 

For part (b), when we differentiate the series term-by-term (you will learn later in Analysis 
that this is OK), we get 

Now putting X = 1 in this series we get 

Y,kpk = E{X). 
For part (c), differentiating twice gives 

^Gx{x) = '£k{k-i)PkX^'^- 
Now putting X = 1 in this series we get 

Y,k{k- \)pk = Y.k'pk - Y^kpk = E{X^)-E{X). 
Adding E{X) and subtracting E{Xf- gives E{X'^) -E{Xf-, which by definition is Var(X). 
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Now let us appply this to the binomial random variable X Bm{n,p). We have 

p^ = P^X = k)="Ck<f-'p\ 
so the probability generating function is 

k=0 

by the Binomial Theorem. Putting x=l gives {q + p)" = 1, in agreement with Proposition 3.4(a). 
Differentiating once, using the Chain Rule, we get np{q + px)"~^ . Putting x = 1 we find that 

E{X) = np. 

Differentiating again, we get n(n — l)p^{q + pxY~^. Putting x= \ gives n{n — l)p^. Now adding 
E{X)-E{XY, we get 

Var(X) =n(n— l)p^ + np — rp'p^ =np — np^ = npq. 

The binomial random variable is tabulated in Table 1 of the Cambridge Statis- 
tical Tables [1]. As explained earlier, the tables give the cumulative distribution 
function. 

For example, suppose that the probability that a certain coin comes down 
heads is 0.45. If the coin is tossed 15 times, what is the probabiUty of five or 
fewer heads? Turning to the page n = 15 in Table 1 and looking at the row 0.45, 
you read off the answer 0.2608. What is the probability of exactly five heads? This 
is P{5 or fewer) — P(4 or fewer), and from tables the answer is 0.2608 — 0. 1204 = 
0.1404. 

The tables only go up to p = 0.5. For larger values of p, use the fact that the 
number of failures in Bin{n,p) is equal to the number of successes in Bin(n, 1 — 
p). So the probability of five heads in 15 tosses of a coin with p = 0.55 is 0.9745 — 
0.9231 =0.0514. 

Another interpretation of the binomial random variable concerns sampling. 
Suppose that we have N balls in a box, of which M are red. We sample n balls 
from the box with replacement; let the random variable X be the number of red 
balls in the sample. What is the distribution of XI Since each ball has probability 
M/N of being red, and different choices are independent, X ~ Bm{n,p), where 
p — M/N is the proportion of red balls in the sample. 

What about sampling without replacement? This leads us to our next random 
variable: 

Hypergeometric random variable Y{g{n,M,N) 

Suppose that we have N balls in a box, of which M are red. We sample n balls 
from the box without replacement. Let the random variable X be the number of 
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red balls in the sample. Such an X is called a hypergeometric random variable 
Hg(n,M,Ar). 

The random variable X can take any of the values 0, 1,2, ... ,n. Its p.m.f. is 
given by the formula 



P{X = k) 



NCn 

For the number of samples of n balls from is ^C„; the number of ways of 
choosing k of the M red balls and n — kofiheN—M others is • ^~^C„-k; and 
all choices are equally likely. 

The expected value and variance of a hypergeometric random variable are as 
follows (we won't go into the proofs): 

, , fM\ , , (M\ (N-M\ fN-n 

E{X) = n — , Var(Z) = n 



J' ' ' \N J \ N J \N-l 

You should compare these to the values for a binomial random variable. If we 

let p = M/N be the proportion of red balls in the hat, then E{X) = np, and Var(Z) 
is equal to npq multiplied by a 'correction factor' {N — n)/{N — 1). 

In particular, if the numbers M and N — M of red and non-red balls in the 
hat are both very large compared to the size n of the sample, then the difference 
between sampling with and without replacement is very small, and indeed the 
'correction factor' is close to 1. So we can say that Hg(n,M,Ar) is approximately 
Bin(n,M/A^) if n is small compared to M and N — M. 

Consider our example of choosing two pens from four, where two pens are 
red, one green, and one blue. The number X of red pens is a Hg(2,2,4) random 
variable. We calculated earUer that P{X = 0) = 1/6, P(X = 1) = 2/3 and P{X = 
2) = 1 /6. From this we find by direct calculation that ^(X) = 1 and Var(Z) = 1 /3. 
These agree with the formulae above. 



Geometric random variable Geom{p) 

The geometric random variable is like the binomial but with a different stopping 
rule. We have again a coin whose probability of heads is p. Now, instead of 
tossing it a fixed number of times and counting the heads, we toss it until it comes 
down heads for the first time, and count the number of times we have tossed 
the coin. Thus, the values of the variable are the positive integers 1, ,2, 3, . . . (In 
theory we might never get a head and toss the coin infinitely often, but if p > 
this possibility is 'infinitely unlikely', i.e. has probability zero, as we will see.) 
We always assume that < p < I. 

More generally, the number of independent Bernoulli trials required until the 
first success is obtained is a geometric random variable. 
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The p.m.f of a Geom(/7) random variable is given by 

P{X = k) = q'-'p, 

where q — I — p. For the event X — k means that we get tails on the first k — I 
tosses and heads on the kth, and this event has probability q^~^p, since 'tails' has 
probability q and different tosses are independent. 
Let's add up these probabilities: 

oo 

since the series is a geometric progression with first term p and common ratio 
q, where ^ < 1 . (Just as the binomial theorem shows that probabilities sum to 1 
for a binomial random variable, and gives its name to the random variable, so the 
geometric progression does for the geometric random variable.) 

We calculate the expected value and the variance using the probability gener- 
ating function. If X ~ Geom(p), the result will be that 

E{X) = l/p, Var(X)=^//. 

We have 

oo 

k=l ^ q^ 

again by summing a geometric progression. Differentiating, we get 

d_ ^ {l-qx)p + pxq ^ p 

^^xW {\-qxf {l-qxf 

Putting x= 1, we obtain 

(1-^)^ p 

Differentiating again gives 2pq/{l — qxf", so 

Var(X) = ^ + l-^.4. 

pi p p^ p^ 

For example, if we toss a fair coin until heads is obtained, the expected number 
of tosses until the first head is 2 (so the expected number of tails is 1); and the 
variance of this number is also 2. 
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Poisson random variable Poisson(X) 

The Poisson random variable, unlike the ones we have seen before, is very closely 
connected with continuous things. 

Suppose that 'incidents' occur at random times, but at a steady rate overall. 
The best example is radioactive decay: atomic nuclei decay randomly, but the 
average number X which will decay in a given interval is constant. The Poisson 
random variabe X counts the number of 'incidents' which occur in a given interval. 
So if, on average, there are 2.4 nuclear decays per second, then the number of 
decays in one second starting now is a Poisson(2.4) random variable. 

Another example might be the number of telephone calls a minute to a busy 
telephone number. 

Although we will not prove it, the p.m.f. for a Poisson(A,) variable X is given 
by the formula 



since the expression in brackets is the sum of the exponential series. 

By analogy with what happened for the binomial and geometric random vari- 
ables, you might have expected that this random variable would be called 'expo- 
nential'. Unfortunately, this name has been given to a closely-related continuous 
random variable which we will meet later. However, if you speak a little French, 
you might use as a mnemonic the fact that if I go fishing, and the fish are biting at 
the rate of X per hour on average, then the number of fish I will catch in the next 
hour is a Poisson(?i) random variable. 

The expected value and variance of a Poisson (A,) random variable X are given 



again using the series for the exponential function. 

Differentiation gives Xei^^^^^^ so E{X) =X. Differentiating again gives X^e^^*^'), so 



P{X = k) = -e-^. 



Let's check that these probabilities add up to one. We get 




by 
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The cumulative distribution function of a Poisson random variable is tabulated 
in Table 2 of the New Cambridge Statistical Tables. So, for example, we find from 
the tables that, if 2.4 fish bite per hour on average, then the probability that I will 
catch no fish in the next hour is 0.0907, while the probability that I catch at five or 
fewer is 0.9643 (so that the probability that I catch six or more is 0.0357). 

There is another situation in which the Poisson distribution arises. Suppose I 
am looking for some very rare event which only occurs once in 1000 trials on av- 
erage. So I conduct 1000 independent trials. How many occurrences of the event 
do I see? This number is really a binomial random variable Bin(1000, 1/1000). 
But it turns out to be Poisson(l), to a very good approximation. So, for example, 
the probability that the event doesn't occur is about 1 /e. 

The general rule is: 

If n is large, p is small, and np = A,, then Bin(n,/>) can be approxi- 
mated by Poisson(A,). 

3.6 Continuous random variables 

We haven't so far really explained what a continuous random variable is. Its target 
set is the set of real numbers, or perhaps the non-negative real numbers or just an 
interval. The crucial property is that, for any real number a, we have {X — a) — 0; 
that is, the probability that the height of a random student, or the time I have to 
wait for a bus, is precisely a, is zero. So we can't use the probability mass function 
for continuous random variables; it would always be zero and give no information. 

We use the cumulative distribution function or c.d.f. instead. Remember from 
last week that the c.d.f. of the random variable X is the function Fx defined by 

Fx{x) = P{X<x). 

Note: The name of the function is Fx; the lower case x refers to the argument 
of the function, the number which is substituted into the function. It is common 
but not universal to use as the argument the lower-case version of the name of the 
random variable, as here. Note that Fx {y) is the same function written in terms of 
the variable y instead of x, whereas Fy{x) is the c.d.f. of the random variable Y 
(which might be a completely different function.) 

Now let Z be a continuous random variable. Then, since the probability that 
X takes the precise value x is zero, there is no difference between P{X <x) and 
P{X<x). 

Proposition 3.5 The c.d.f. is an increasing function (this means that Fx{x) < 
Fx{y) ifx <y), and approaches the limits a* x — > — oo and 1 a* x — > <». 
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The function is increasing because, ifx<y, then 

Fx{y) - Fx (x) = P{X <y)- P{X < x) = P{x <X <y) >0. 

Also Fx = 1 because X must certainly take some finite value; and Fx{—°°) — 
because no value is smaller than — oo! 

Another important function is the probability density function fx. It is ob- 
tained by differentiating the c.d.f.: 

Now fxix) is non-negative, since it is the derivative of an increasing function. If 
we know fx{x), then Fx is obtained by integrating. Because Fx{—°°) — 0, we 
have 

Fx{x)= r fx{t)dt. 

J — oo 

Note the use of the "dummy variable" t in this integral. Note also that 
P{a<X<b) = Fxib)-Fx{a)= fx{t)dt. 

J a 

You can think of the p.d.f. like this: the probability that the value of X lies in a 
very small interval from x io x-\-h\s approximately fx {x) ■ h. So, although the 
probabiUty of getting exactly the value x is zero, the probability of being close to 
X is proportional to fx{x). 

There is a mechanical analogy which you may find helpful. Remember that 
we modelled a discrete random variable X by placing at each value aofXa mass 
equal to P{X — a). Then the total mass is one, and the expected value of X is 
the centre of mass. For a continuous random variable, imagine instead a wire of 
variable thickness, so that the density of the wire (mass per unit length) at the 
point X is equal to fx{x). Then again the total mass is one; the mass to the left of 
X is Fx{x); and again it will hold that the centre of mass is at E{X). 

Most facts about continuous random variables are obtained by replacing the 
p.m.f. by the p.d.f. and replacing sums by integrals. Thus, the expected value of 
X is given by 

/oo 
xfx{x)dx, 
-oo 

and the variance is (as before) 

Var(X) =£(X2)-£(X)2, 

where 

/oo 
x'fx{x)dx. 
-oo 

It is also true that Var(Z) = E((X -/i)^), where // = E{X). 



3.7. MEDIAN, QUARTILES, PERCENTILES 



57 



We will see examples of these calculations shortly. But here is a small example 
to show the ideas. The support of a continuous random variable is the smallest 
interval containing all values of x where fxix) > 0. 

Suppose that the random variable X has p.d.f. given by 

f ( \ _ (2x if < < 1, 
/xW-\q otherwise. 

The support of Xis the interval [0, 1]. We check the integral: 

r fx{x)dx= ['2xdx= H^:J=1. 



Fxix)= r fx{t)dt 

J —oo 



The cumulative distribution function of X is 

ifx<0, 

x^ ifO<x<l, 

1 if jc > 1. 

(Study this carefully to see how it works.) We have 

2 

3' 



/oo p\ 
xfx{x)dx= / Ix^dx: 
-oo Jo 

/oo pi 1 

x'fxix)dx= / 2x^dx=-, 
-oo Jo 2 



1 /2\^ 1 



Var(Z) ^ , ^ , 

^ ^ 2 \3 J 18 

3.7 Median, quartiles, percentiles 

Another measure commonly used for continuous random variables is the median; 
this is the value m such that "half of the distribution lies to the left of m and half to 
the right". More formally, m should satisfy Fx{m) = 1/2. It is not the same as the 
mean or expected value. In the example at the end of the last section, we saw that 
E{X) = 2/3. The median of X is the value of m for which Fx{m) = 1/2. Since 
Fx{x) = x^ for <x < 1, we see that m = 1/ \/2. 

If there is a value m such that the graph ofy — fx (x) is symmetric about x — m, 
then both the expected value and the median of X are equal to m. 

The lower quartile I and the upper quartile u are similarly defined by 

Fx{l) = 1/4, Fx{u) = 3/4. 

Thus, the probability that X lies between / and m is 3/4 — 1/4 = 1/2, so the quar- 
tiles give an estimate of how spread-out the distribution is. More generally, we 
define the nth percentile of X to be the value of x„ such that 

Fx{xn)=n/\QQ, 
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that is, the probabiUty that X is smaller than x„ is n%. 

Reminder If the c.d.f. of X is Fx (x) and the p.d.f. is fx (x), then 

• differentiate Fx to get fx, and integrate fx to get Fx; 

• use fx to calculate E{X) and Var(X); 

• use Fx to calculate P{a <X <b) (this is Fx{b) —Fx{a)), and the median 
and percentiles of X. 

3.8 Some continuous random variables 

In this section we examine three important continuous random variables: the uni- 
form, exponential, and normal. The details are summarised in Appendix B. 

Uniform random variable U{a,b) 

Let a and b be real numbers with a <b. A uniform random variable on the interval 
[a, b] is, roughly speaking, "equally likely to be anywhere in the interval". In other 
words, its probability density function is constant on the interval [a, b] (and zero 
outside the interval). What should the constant value c be? The integral of the 
p.d.f. is the area of a rectangle of height c and base b — a; this must be 1, so 
c=l/{b — a). Thus, the p.d.f. of the random variable X r^U {a,b) is given by 



Further calculation (or the symmetry of the p.d.f.) shows that the expected value 
and the median of X are both given by {a + b)/2 (the midpoint of the interval), 
while Var(Z) = (&- a)V 12. 

The uniform random variable doesn't really arise in practical situations. How- 
ever, it is very useful for simulations. Most computer systems include a random 
number generator, which apparently produces independent values of a uniform 
random variable on the interval [0, 1]. Of course, they are not really random, since 
the computer is a deterministic machine; but there should be no obvious pattern to 





otherwise. 



By integration, we find that the c.d.f. is 
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the numbers produced, and in a large number of trials they should be distributed 
uniformly over the interval. 

You will learn in the Statistics course how to use a uniform random variable 
to construct values of other types of discrete or continuous random variables. Its 
great simplicity makes it the best choice for this purpose. 



Exponential random variable Exp(?i) 

The exponential random variable arises in the same situation as the Poisson: be 
careful not to confuse them! We have events which occur randomly but at a con- 
stant average rate of X per unit time (e.g. radioactive decays, fish biting). The 
Poisson random variable, which is discrete, counts how many events will occur 
in the next unit of time. The exponential random variable, which is continuous, 
measures exactly how long from now it is until the next event occurs. Not that it 
takes non-negative real numbers as values. 
If X ~ Bxp{X), the p.d.f. of Z is 

if ;c < 0, 
;ie-^ ifjc>0. 

By integration, we find the c.d.f. to be 

c / X fO ifx<0, 
^^^'^'* = \l-e-^ ifx>0. 

Further calculation gives 

E(X) = l/X, Yar(X) = l/X^. 

The median m satisfies 1 — e"^*" = 1/2, so that m — loglfk. (The logarithm is to 
base e, so that log 2 = 0.69314718056 approximately. 




Normal random variable N{/ii, o^) 

The normal random variable is the commonest of all in applications, and the most 
important. There is a theorem called the central limit theorem which says that, for 
virtually any random variable X which is not too bizarre, if you take the sum (or 
the average) of n independent random variables with the same distribution as X, 
the result will be approximately normal, and will become more and more like a 
normal variable as n grows. This partly explains why a random variable affected 
by many independent factors, like a man's height, has an approximately normal 
distribution. 
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More precisely, if n is large, then a Bm{n,p) random variable is well approx- 
imated by a normal random variable with the same expected value np and the 
same variance npq. (If you are approximating any discrete random variable by a 
continuous one, you should make a "continuity correction" - see the next section 
for details and an example.) 

The p.d.f. of the random variable X ~ N{iii, o^) is given by the formula 

= ^ e-(-^)V2-\ 

We have E{X) = jji and Var(X) = o^. The picture below shows the graph of this 
function for /j — O, the familiar 'bell-shaped curve'. 



The c.d.f. of X is obtained as usual by integrating the p.d.f. However, it is not 
possible to write the integral of this function (which, stripped of its constants, is 
e~^) in terms of 'standard' functions. So there is no alternative but to make tables 
of its values. 

The crucial fact that means that we don't have to tabulate the function for all 
values of /u and a is the following: 

Proposition 3.6 IfX ~ A^(^, o^), and Y = {X-p)/a, then Y ~ A^(0, 1). 

So we only need tables of the c.d.f. for A'^(0, 1) - this is the so-called standard 
normal random variable - and we can find the c.d.f. of any normal random vari- 
able. The c.d.f. of the standard normal is given in Table 4 of the New Cambridge 
Statistical Tables [1]. The function is called 4> in the tables. 

For example, suppose that X ~ A^(6, 25). What is the probability that Z < 8? 
Putting Y = {X- 6)/5, so that Y ~ A^(0, 1), we find that X < 8 if and only if 
< (8 - 6)/5 = 0.4. From the tables, the probability of this is ^(0.4) = 0.6554. 

The p.d.f. of a standard normal r.v. Y is symmetric about zero. This means 
that, for any positive number c, 

4)(-c) =P(y < -c)^P(Y>c) = \ -P{Y <c) = l-4>(c). 

So it is only necessary to tabulate the function for positive values of its argument. 
So, if X - A^(6, 25) and Y={X-€)/5 as before, then 

P{X < 3) = P{Y < -0.6) = 1 - P(F < 0.6) = 1 - 0.7257 = 0.2743. 
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3.9 On using tables 

We end this section with a few comments about using tables, not tied particularly 
to the normal distribution (though most of the examples will come from there). 

Interpolation 

Any table is limited in the number of entries it contains. Tabulating something 
with the input given to one extra decimal place would make the table ten times as 
bulky! Interpolation can be used to extend the range of values tabulated. 

Suppose that some function F is tabulated with the input given to three places 
of decimals. It is probably true that F is changing at a roughly constant rate 
between, say, 0.28 and 0.29. So F (0.283) will be about three-tenths of the way 
between F (0.28) andF(0.29). 

For example, if 4> is the c.d.f. of the normal distribution, then 4>(0.28) = 
0.6103 and ^>(0.29) = 0.6141, so ^)(0.283) = 0.6114. (Three-tenths of 0.0038 is 
0.0011.) 

Using tables in reverse 

This means, if you have a table of values of F, use it to find x such that F{x) is a 
given value c. Usually, c won't be in the table and we have to interpolate between 
values xi and X2, where F{xi) is just less than c and ^(^2) is just greater. 

For example, if O is the c.d.f. of the normal distribution, and we want the 
upper quartile, then we find from tables 0(0.67) = 0.7486 and 4>(0.68) = 0.7517, 
so the required value is about 0.6745 (since 0.0014/0.0031 = 0.45). 

In this case, the percentile points of the standard normal r.v. are given in Table 
5 of the New Cambridge Statistical Tables [1], so you don't need to do this. But 
you will find it necessary in other cases. 

Continuity correction 

Suppose we know that a discrete random variable X is well approximated by a 
continuous random variable Y. We are given a table of the c.d.f. of Y and want to 
find information about X. For example, suppose that X takes integer values and 
we want to find P{a < X < b), where a and b are integers. This probability is 
equal to 

PiX = a) + Pix = a+l) + --- + PiX = b). 

To say that X can be approximated by Y means that, for example, P{X — a) is 
approximately equal to fyia), where fy is the p.d.f. of Y. This is equal to the area 
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of a rectangle of height fy (a) and base 1 (from a — 0.5 to a + 0.5). This in turn is, 
to a good approximation, the area under the curve y = frix) from x = a — 0.5 to 
X = a + 0.5, since the pieces of the curve above and below the rectangle on either 
side of jc = a will approximately cancel. Similarly for the other values. 



Adding all these pieces, we find that P{a <X<b)is approximately equal to 
the area under the curve y — /y (x) from x — a — 0.5 to x — b + 0.5. This area is 
given by Fy (Z? + 0.5) — Fy (a — 0.5), since Fy is the integral of /y. Said otherwise, 
this is P{a-0.5<Y <b + 0.5). 

We summarise the continuity correction: 

Suppose that the discrete random variable X, taking integer values, is 
approximated by the continuous random variable Y. Then 

P{a<X< b)^P{a-0.5 <Y < Z7+0.5) =Fy(Z7+ 0.5) -Fy(fl-0.5). 

(Here, ^ means "approximately equal".) Similarly, for example, P{X <b) 
P{Y<b + 0.5), and P{X >a)^P{Y>a- 0.5). 

Example The probability that a light bulb will fail in a year is 0.75, and light 
bulbs fail independently. If 192 bulbs are installed, what is the probability that the 
number which fail in a year Ues between 140 and 150 inclusive? 

Solution Let X be the number of light bulbs which fail in a year. Then X ~ 
Bin(192,3/4), and so E{X) = 144, Var(X) = 36. So X is approximated by F ~ 
iV(144,36),and 




0.5 a 



a+0.5 



F(140<X < 150) a; F 



(139.5 <y < 150.5) 



by the continuity correction. 
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Let Z = (F - 144) /6. Then Z ~ N{0, 1), and 

..^.x / 139.5 -144 150.5 -144\ 
P(139.5 < F < 150.5) = pi <Z< j 

= P(-0.75 <Z< 1.083) 

= 0.8606-0.2268 (from tables) 

= 0.6338. 



3.10 Worked examples 

Question I roll a fair die twice. Let the random variable X be the maximum of 
the two numbers obtained, and let Y be the modulus of their difference (that is, 
the value of Y is the larger number minus the smaller number). 

(a) Write down the joint p.m.f. of {X,Y). 

(b) Write down the p.m.f. of X, and calculate its expected value and its variance. 

(c) Write down the p.m.f. of Y, and calculate its expected value and its variance. 

(d) Are the random variables X and Y independent? 



Solution (a) 



X 



1 2 3 4 5 







36 
36 36 














X A A 

36 36 36 

X 2_ 2_ 2_ 

36 36 36 36 

X 2_ 2_ 2_ 

36 36 36 36 

J_ 2_ 2_ 2_ _ _ 

36 36 36 36 36 36 



























2 
36 





2 


2 



The best way to produce this is to write out a 6 x 6 table giving all possible values 
for the two throws, work out for each cell what the values of X and Y are, and 
then count the number of occurrences of each pair. For example: X = 5, Y = 2 
can occur in two ways: the numbers thrown must be (5, 3) or (3,5). 

(b) Take row sums: 



X 


1 


2 


3 


4 


5 


6 


P{X = x) 


1 

36 


3 

36 


36 


7 

36 


9 

36 


li 

36 
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Hence in the usual way 



E{X) = 



161 
'36' 



Var(X) = 



2555 
1296" 



(c) Take column sums: 



y 





1 


2 


3 


4 


5 


P{Y^y) 


6 

36 


10 
36 


8 

36 


6 

36 


4 
36 


2 

36 



and so 



E{Y) 



35 



Var(y) 



665 



18' ' ' 324 

(d) No: e.g. P(Z = 1,7 = 2) = but P{X = 1) • P{Y = 2) 



1296- 



Question An archer shoots an arrow at a target. The distance of the arrow from 
the centre of the target is a random variable X whose p.d.f. is given by 



^""^^ to if;c>3. 



The archer's score is determined as follows: 



Distance 


X<0.5 


0.5<Z< 1 


1 <X< 1.5 


1.5<Z<2 


X>2 


Score 


10 


7 


4 


1 






Construct the probability mass function for the archer's score, and find the archer's 
expected score. 



Solution First we work out the probability of the arrow being in each of the 
given bands: 

^0-5 3 + 2x-x2 



P{X<0.5) = Fx{0.5)-Fx{0) 



Jo 



27 



dx 

1/2 




41 
216' 



Similarly we find that P(0.5 < X < 1) = 47/216, P(l < Z < 1.5) = 47/216, 
P(1.5 < Z < 2) = 41 /216, and P{X > 2) = 40/216. So the p.m.f. fot the archer's 
score S is 



s 





1 4 


7 


10 


P{S = s) 


40 
216 


41 47 
216 216 


47 
216 


41 
216 



3.10. WORKED EXAMPLES 



65 



Hence 



m = 



41+47-4 + 47-7 + 4M0 121 



216 



27 



Question Let T be the lifetime in years of new bus engines. Suppose that T is 
continuous with probability density function 




for X > 1 



for some constant d. 

(a) Find the value of d. 

(b) Find the mean and median of T. 

(c) Suppose that 240 new bus engines are installed at the same time, and that 

their lifetimes are independent. By making an appropriate approximation, 
find the probability that at most 10 of the engines last for 4 years or more. 

Solution (a) The integral of /r(x), over the support of T, must be 1. That is, 

d 



1 



_ d 
~ Ji 



dx 



-d 



_2x2 
= d/2, 

so d = 2. 

(b) The c.d.f. of T is obtained by integrating the p.d.f.; that is, it is 

for X < 1 



1 ^ for X > 1 

x^ 



The mean of T is 



/too poo 2^ 

/ x/7'(x)dx= / ^ dx = 2. 

J \ J \ X 



The median is the value m such that Frim) = 1/2. That is, 1 — l/m^ = 1/2, 
or m = V2. 
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(c) The probability that an engine lasts for four years or more is 



1-Fr(4) = 1- 1 



( 



4 



1 




16 



1 



So, if 240 engines are installed, the number which last for four years or more 
is a binomial random variable X ~ Bin(240, 1/16), with expected value 240 x 



(1/16) = 15 and variance 240 x (1/16) x (15/16) = 225/16. 

We approximate X by 7 ~ A'^(15, (15/4)^). Using the continuity correction, 
P{X < 10) ^P{Y < 10.5). 

Now, if Z = (F - 15)/(15/4), then Z ~ N{0, 1), and 



using the table of the standard normal distribution. 

Note that we start with the continuous random variable T, move to the discrete 
random variable X, and then move on to the continuous random variables Y and 
Z, where finally Z is standard normal and so is in the tables. 

A true story The answer to the question at the end of the last chapter: As the 
students in the class obviously knew, the class included a pair of twins ! (The twins 
were Leo and Willy Moser, who both had successful careers as mathematicians.) 

But what went wrong with our argument for the Birthday Paradox? We as- 
sumed (without saying so) that the birthdays of the people in the room were inde- 
pendent; but of course the birthdays of twins are clearly not independent! 



P{Y < 10.5) 



= P{Z< -1.2) 
= I -P{Z< 1.2) 
= 0.1151 



Chapter 4 

More on joint distribution 



We have seen the joint p.m.f. of two discrete random variables X and Y, and we 
have learned what it means for X and Y to be independent. Now we examine 
this further to see measures of non-independence and conditional distributions of 
random variables. 

4.1 Covariance and correlation 

In this section we consider a pair of discrete random variables X and Y. Remember 
that X and Y are independent if 

P{X = au Y = bj) = P{X = ai) ■ P{Y = bj) 

holds for any pair {ai,bj) of values of X and Y. We introduce a number (called 
the covariance of X and Y) which gives a measure of how far they are from being 
independent. 

Look back at the proof of Theorem 21(b), where we showed that if X and Y 
are independent then Var(X + Y) = Var(X) + Var(y ) . We found that, in any case, 

Var(Z + y) = Wdx{X)+ydx{Y)+2{E{XY)-E{X)E{Y)), 

and then proved that if X and Y are independent then E{XY) = E{X)E{Y), so that 
the last term is zero. 

Now we define the covariance of X and Y to be E{XY) — E{X)E{Y). We 
write Cov(X,y) for this quantity. Then the argument we had earlier shows the 
following: 

Theorem 4.1 (a) Var(Z + F) = Var(Z) + Var(F) + 2Cov(Z, F). 
(b) IfX and Y are independent, then Cov(Z,F) = 0. 
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In fact, a more general version of (a), proved by the same argument, says that 

Var{aX + bY)=a^Yav{X)+b^Yar{Y) + 2abCov{XJ). (4.1) 

Another quantity closely related to covariance is the correlation coefficient, 
corr(X, Y), which is just a "normalised" version of the covariance. It is defined as 
follows: 

Cov(x,y) 

corr X,y = ^ ' 

v/Var(X) Var(y) 

The point of this is the first part of the following theorem. 
Theorem 4.2 Let X and Y be random variables. Then 

(a) -1 <corr(X,y) < 1; 

(b) ifX and Y are independent, then corr(X, Y) —Q; 

(c) ifY = mX + c for some constants m^O and c, then corr(Z, Y) = I ifm > 0, 

andcon{X,Y) = — 1 ifm < 0. 

The proof of the first part is optional: see the end of this section. But note that 
this is another check on your calculations: if you calculate a correlation coefficient 
which is bigger than 1 or smaller than —1, then you have made a mistake. Part (b) 
follows immediately from part (b) of the preceding theorem. 

For part (c), suppose that Y = mX + c. Let = n and Var(X) = a, so that =i/ + a. 
Now we just calculate everything in sight. 



E{Y) 


= E{mX + c) =mE{X)+c = miJ+c 


E{Y') 


= E {rn^X^ + ImcX + c^)=m^ {i? + a) + Imcn 


Var(y) 


= E{Y^)-E{Yf ^n?a 


E{XY) 


= E {mX^ + cX) =m{i/ + a)+ c/j; 


Coy{X,Y) 


= E{XY)-E{X)E{Y) = ma 


corr(X,y) 


= Cov(X,y)/ VVar(X) Var(F) = ma/Vrrfla?- 




_ f+1 ifm>0, 




~ l-l ifm<0. 



Thus the correlation coefficient is a measure of the extent to which the two 
variables are related. It is + 1 if F increases linearly with X; if there is no relation 
between them; and —1 if 7 decreases linearly as X increases. More generally, a 
positive correlation indicates a tendency for larger X values to be associated with 
larger Y values; a negative value, for smaller Z values to be associated with larger 
Y values. 



4. 1 . COVARIANCE AND CORRELATION 
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Example I have two red pens, one green pen, and one blue pen, and I choose 
two pens without replacement. Let X be the number of red pens that I choose and 
Y the number of green pens. Then the joint p.m.f. of X and Y is given by the 
following table: 

Y 








1 








i 
6 


1 


1 

3 


1 
3 


2 


1 

6 






From this we can calculate the marginal p.m.f. of X and of Y and hence find 
their expected values and variances: 

E{X) - 1, Var(X) = 1/3, 
E{Y) = 1/2, Var(y) = l/4. 

Also, E{XY) = 1/3, since the sum 

E{XY) = Y,aibjP{X = ai,Y = bj) 

contains only one term where all three factors are non-zero. Hence 

Cow{X,Y) = 1/3-1/2^-1/6, 

and 

-1/6 1 

The negative correlation means that small values of X tend to be associated with 
larger values of Y. Indeed, if Z = then Y must be 1, and if Z = 2 then Y must 
be 0, but if Z = 1 then Y can be either or 1 . 

Example We have seen that if X and Y are independent then Cov(X,y) = 0. 
However, it doesn't work the other way around. Consider the following joint 
p.m.f. 

Y 





-1 1 


-1 



1 


i i 

5^5 

i 

1 i 

5^5 
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Now calculation shows that E{X) = E{Y) = E{XY) = 0, so Cov(X,F) = 0. But 
X and Y are not independent: for P{X = -1) = 2/5, P{Y = 0) = 1 /5, but P{X = 
-l,y = 0) =0. 

We call two random variables X and Y uncorrelated if Cov(X, 7 ) = (in other 
words, if corr(X, Y) = 0). So we can say: 

Independent random variables are uncorrelated, but uncorrelated ran- 
dom variables need not be independent. 

Here is the proof that the correlation coefficient lies between —1 and 1. Clearly this is exactly 
equivalent to proving that its square is at most 1, that is, that 

Cov(X,y)2 < Var(X) • Var(y). 

This depends on the following fact: 

Let p,q,rhe real numbers with p>0. Suppose that px^ + 2qx + r > for all real 
numbers x. Then < pr. 

For, when we plot the graph >■ = px^ + 2qx + r, we get a parabola; the hypothesis means that this 
parabola never goes below the X-axis, so that either it lies entirely above the axis, or it touches it 
in one point. This means that the quadratic equation px^ + 2qx+r = either has no real roots, or 
has two equal real roots. From high-school algebra, we know that this means that q^ < pr. 

Now let p = Var(Z), q = Cov(X,F), and r = Var(F). Equation (4.1) shows that 

px^ + 2qx+r = Yai{xX + Y) . 

(Note that x is an arbitrary real number here and has no cormection with the random variable X !) 

Since the variance of a random variable is never negative, we see that px^ + 2qx+r > for all 
choices of x. Now our argument above shows that q^ < pr, that is, Cov{X,Y)^ < Var(X) • Var(F), 
as required. 



4.2 Conditional random variables 

Remember that the conditional probability of event B given event A is P{B \ A) = 
P{AnB)/P{A). 

Suppose that X is a discrete random variable. Then the conditional probability 
that X takes a certain value a,, given A, is just 

, _ I , X _ P{A holds and X = at) 
P{X-at\A)- — . 

This defines the probability mass function of the conditional random variable 
X |A. 

So we can, for example, talk about the conditional expectation 
E{X\A) = Y,aiP{X = ai\A). 
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Now the event A might itself be defined by a random variable; for example, A 
might be the event that Y takes the value bj. In this case, we have 

, P(X = ai,Y = bj) 
P{X = ai\Y = bj)= ^ ' ^' 



P{Y = bj) 

In other words, we have taken the column of the joint p.m.f. table of X and Y 
corresponding to the value Y — bj. The sum of the entries in this column is just 
P{Y = bj), the marginal distribution of Y. We divide the entries in the column by 
this value to obtain a new distribution of X (whose probabilities add up to 1). 
In particular, we have 

E{X\Y ^ bj) - Y^aiP{X - a/ I y = bj). 



Example I have two red pens, one green pen, and one blue pen, and 1 choose 
two pens without replacement. Let X be the number of red pens that I choose and 
Y the number of green pens. Then the joint p.m.f. of X and Y is given by the 
following table: 

Y 








1 








1 
6 


1 


1 

3 


1 

3 


2 


1 

6 






In this case, the conditional distributions of X corresponding to the two values 
of Y are as follows: 



We have 



a 


1 


2 






a 


1 2 


P{X = a\Y = 0) 


^ 
^ 3 


1 

3 


P{X = a 


Y = 


1) 


i ^ 

3 3^ 


E{X 1 Y 


= 0) = 


4 


E{X 1 Y = 


1) = 


2 




3' 


3' 





If we know the conditional expectation of X for all values of F, we can find 
the expected value of X: 

Proposition 4.3 E{X) = | Y = bj)P{Y = bj). 

j 



Proof: 



E{X) = Y,aiP{X = ai) 

i 
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' J 

= X^(^ciiP{X^ai\Y^b^P{Y^bj) 

= Y^EiX\Y = bj)PiY = bj). 
j 

In the above example, we have 

E{X) = E{X\Y = 0)P{Y = 0) + E{X\Y =l)P{Y =1) 
= (4/3) X (1/2) + (2/3) X (1/2) 
= 1. 



Example Let us revisit the geometric random variable and calculate its expected 
value. Recall the situation: I have a coin with probability p of showing heads; I 
toss it repeatedly until heads appears for the first time; X is the number of tosses. 

Let Y be the Bernoulli random variable whose value is 1 if the result of the 
first toss is heads, if it is tails. If 7 = 1, then we stop the experiment then and 
there; so if F = 1, then necessarily X = 1, and we have E{X \ Y = 1) = I. On 
the other hand, if Y = 0, then the sequence of tosses from that point on has the 
same distribution as the original experiment; so E{X \ Y — 0) — l+E{X) (the 1 
counting the first toss). So 

E{X) = E{X\Y = 0)P{Y = 0)+E{X\Y ^l)P{Y ^l) 
= {l+E{X))-q+l-p 
= E{X){l-p) + U 

rearranging this equation, we find that E{X) = l/p, confirming our earlier value. 

In Proposition 2.1, we saw that independence of events can be characterised 
in terms of conditional probabilities: A and B are independent if and only if they 
satisfy P{A \ B) = P{A). A similar result holds for independence of random vari- 
ables: 

Proposition 4.4 Let X and Y be discrete random variables. Then X and Y are 
independent if and only if, for any values aj and bj ofX and Y respectively, we 
have 

P{X = ai\Y = bj) = P{X = at). 

This is obtained by applying Proposition 15 to the events X = ai and Y = bj. 
It can be stated in the following way: X and Y are independent if the conditional 
p.m.f. of X I (F = bj) is equal to the p.m.f. of X, for any value bj of Y. 



4.3. JOINT DISTRIB UTION OF CONTINUOUS R. V.S 
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4.3 Joint distribution of continuous r.v.s 

For continuous random variables, the covariance and correlation can be defined 
by the same formulae as in the discrete case; and Equation (4.1) remains valid. 
But we have to examine what is meant by independence for continuous random 
variables. The formalism here needs even more concepts from calculus than we 
have used before: functions of two variables, partial derivatives, double integrals. 
I assume that this is unfamiliar to you, so this section will be brief and can mostly 
be skipped. 

LetZ and Y be continuous random variables. The joint cumulative distribution 
function of X and Y is the function Fxj of two real variables given by 

Fxj{x,y)=P{X<xJ<y). 

We define X and Y to be independent if P{X < xj < y) = P{X < x) ■ P{Y < y), 
for any x and y, that is, Fx;Y{x,y) — Fx{x) ■ Fyiy). (Note that, just as in the one- 
variable case, X is part of the name of the function, while x is the argument of the 
function.) 

The joint probability density function of Z and Y is 

fx,Y{x,y) = ^Fx,Y{x,y). 

In other words, differentiate with respect to x keeping y constant, and then differ- 
entiate with respect to y keeping x constant (or the other way round: the answer is 
the same for all functions we consider.) 

The probability that the pair of values of {X^Y) corresponds to a point in some 
region of the plane is obtained by taking the double integral of fx,Y over that 
region. For example, 

P{a<X<b,c<Y <d)= r fx,Yix,y)dxdy 

J c J a 

(the right hand side means, integrate with respect to x between a and b keeping y 
fixed; the result is a function of y; integrate this function with respect to y from c 
tod.) 

The marginal p.d.f. of X is given by 

/CO 
fx,Y{x,y)dy, 
-oo 

and the marginal p.d.f. of Y is similarly 

poo 

fY{y) = / fx,Y{x,y)dx. 
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Then the conditional p.d.f. of Z | (Y = b) is 

The expected value of XY is, not surprisingly, 

/OO /"OO 
/ x);/x,y(x,);)dxdj, 
-OO J — OO 

and then as in the discrete case 

Co\iXJ) = EiXY)-EiX)EiY), corr(X,F) = ^ ' ' =. 

Finally, and importantly. 

The continuous random variables X and Y are independent if and only 
if 

fx,Y{x,y) = fx{x) ■ fyiy)- 

As usual this holds if and only if the conditional p.d.f. ofX\{Y — b) is equal to 
the marginal p.d.f. of X, for any value b. Also, if X and Y are independent, then 
Cov(X,F) = corr(Z,F) = (but not conversely!). 



4.4 Transformation of random variables 

If a continuous random variable 7 is a function of another r.v. X, we can find the 
distribution of Y in terms of that of X. 

Example Let X and Y be random variables. Suppose that X ~ iy [0, 4] (uniform 
on [0, 4]) and Y = y/X. What is the support of F? Find the cumulative distribution 
function and the probability density function of F. 

Solution (a) The support of X is [0, 4] , and F = ^/X, so the support of F is [0, 2] . 
(b) We have fx{x) = x/4 for < jc < 4. Now 



Fviy) = P{Y<y) 

= P{x<y^) 
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for < y < 2; of course Fyiy) = for y < and Fyiy) = 1 for y > 2. (Note that 
F < y if and only if Z < since Y = Vx.) 

(c) We have 

fy{y) = ^Fy{y) = l'^^ if0<y<2, 

jyyjj ryjj Iq otherwise. 



The argument in (b) is the key. If we know 7 as a function of X, say Y = g{X), 
where g is an increasing function, then the event Y <y is the same as the event 
X < h{Y), where h is the inverse function of g. This means that y = g{x) if and 
only if x = h{y). (In our example, g{x) = yjx, and so h{y) = y^.) Thus 

FY{y)=Fx{h{y)), 

and so, by the Chain Rule, 

fY{y)^fx{h{y))h'{y), 

where h' is the derivative of h. (This is because fxix) is the derivative of Fx{x) 
with respect to its argument x, and the Chain Rule says that if x = h{y) we must 
multiply by h'{y) to find the derivative with respect to y.) 
Applying this formula in our example we have 

fY{y) = \-2y=l 

for < y < 2, since the p.d.f. of X is fx (x) = 1/4 for < x < 4. 
Here is a formal statement of the result. 



Theorem 4.5 Let X be a continuous random variable. Let g be a real function 
which is either strictly increasing or strictly decreasing on the support ofX, and 
which is dijferentiable there. Let Y = g{X). Then 

(a) the support ofY is the image of the support ofX under g; 

(b) the p.d.f. ofY is given by fyiy) — fx{h{y))\h' {y)\, where h is the inverse 
function ofg. 



For example, here is the proof of Proposition 3.6: ifX ~ N{iJi,<5^) and Y = 
{X - n) /a, then Y ~ N{Q, 1). Recall that 

/z(x) = ^e-(-'')V2-^ 

Ov27l 
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We have Y = g{X), where g{x) = (x — //)/o; this function is everywhere strictly 
increasing (the graph is a straight line with slope 1 /a), and the inverse function is 
X = h{y) = oy + fi. Thus, h'{y) = a, and 



the p.d.f. of a standard normal variable. 

However, rather than remember this formula, together with the conditions for 
its validity, I recommend going back to the argument we used in the example. 

If the transforming function g is not monotonic (that is, not either increasing 
or decreasing), then life is a bit more complicated. For example, if X is a random 
variable taking both positive and negative values, and Y = X^, then a given value 
J of y could arise from either of the values ^/y and — of X, so we must work 
out the two contributions and add them up. 



Example X ~ ^-(0, 1 ) and F = X^. Find the p.d.f. of F. 

The p.d.f. of X is (l/V27i)e-^'/2. Let ^{x) be its c.d.f., so that P{X <x) = 
^(x), and 



Friy) = PiY<y) 

= p{-Vy<x<Vy) 

= ^iVy - ~ ^iVy)) (by symmetry of A^(0, 1)) 

= 2«J>(V5^)-1. 




1 



Now F = X2, 



so F < y if and only if —y/y < X < ^/y. Thus 




So 



friy) = ^^Yiy) 



2Vy 



1 



(by the Chain Rule) 



1 



y/2%y 



Of course, this is valid for y > 0; for y < 0, the p.d.f. is zero. 
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Note the 2 in the Une labelled "by the Chain Rule". If you blindly applied 

the formula of Theorem 4.5, using h{y) = \/y, you would not get this 2; it arises 
from the fact that, since Y = X^, each value of Y corresponds to two values of X 
(one positive, one negative), and each value gives the same contribution, by the 
symmetry of the p.d.f. of X. 

4.5 Worked examples 

Question Two numbers X and Y are chosen independently from the uniform 
distribution on the unit interval [0, 1]. Let Z be the maximum of the two numbers. 
Find the p.d.f. of Z, and hence find its expected value, variance and median. 

Solution The c.d.f.s of X and Y are identical, that is. 



(For, if both X and Y are smaller than a given value x, then so is their maximum; 
but if at least one of them is greater than x, then again so is their maximum.) For 
< X < 1, we have P{X <x) = P{Y <x) =x; by independence. 



Thus P{Z <x)^ x^. Of course this probability is if x < and is 1 if x > 1. So 
the c.d.f. of Z is 



The median of Z is the value of m such that Fz{m) = 1/2, that is = 1 /2, or 
m=l/V2. 

We obtain the p.d.f. of Z by differentiating: 




Z = max{X,Y) <x if and only if X <x and Y <x. 



P{X < X and Y <x) =x-x = x^. 






78 



CHAPTER 4. MORE ON JOINT DISTRIBUTION 



Question I roll a fair die bearing the numbers 1 to 6. If is the number showing 
on the die, I then toss a fair coin times. Let X be the number of heads I obtain. 

(a) Write down the p.m.f. for X. 

(b) Calculate E(X) without using this information. 



Solution (a) If we were given that N = n, say, then X would be a binomial 
Bin(n, 1 /2) random variable. So P{X = k\ N = n) = «Q(l/2)". 
By the ToTP, 

6 

p(X = k)=Y,PiX = k\N = n)P{N = n). 

n=l 

Clearly P{N = «) = 1/6 for n = 1, . . . , 6. So to find P{X = k), we add up the 
probability that X = /: for a Bin(n, 1/2) r.v. for n = /:,..., 6 and divide by 6. (We 
start at k because you can't get k heads with fewer than k coin tosses!) The answer 
comes to 



k 





1 


2 


3 


4 


5 


6 


P{X = k) 


63 
384 


120 
384 


99 
384 


64 
384 


29 
384 


8 
384 


1 

384 



For example, 

pry = A\ = 'Q(l/2)4 + 5C4(l/2)^ +6C4(l/2)6 ^ 4+10+ 15 
^ ' 6 384 ■ 

(b) By Proposition 4.3, 



= £ £(X I (A^ = n))P{N = n). 



n=l 



Now if we are given that N — n then, as we remarked, X has a binomial Bin(n, 1/2) 
distribution, with expected value n/2. So 

rfv^ /o^ n/A^ 1 + 2 + 3+4 + 5 + 6 7 
E{X) = 2^ (n/2) -(1/6) = 



, 2-6 4 

n=l 



Try working it out from the p.m.f. to check that the answer is the same 
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Mathematical notation 



The Greek alphabet 





Name 


Capital 


Lowercase 




alpha 


A 


a 




beta 


B 


P 




gamma 


r 


Y 




delta 


A 


5 




epsilon 


E 


£ 




zeta 


Z 






eta 


H 






theta 





e 




iota 


I 


I 




kappa 


K 


K 




lambda 


A 


I 


Mathematicians use the Greek alpha- 


mu 


M 


M 


bet for an extra supply of symbols. 


nu 


N 


V 


Some, like %, have standard meanings. 


xi 


M 




You don't need to learn this; keep it 


omicron 





o 


for reference. Apologies to Greek stu- 


pi 


n 




dents: you may not recognise this, but 


rho 


p 


P 


it is the Greek alphabet that mathe- 


sigma 


L 


a 


maticians use! 


tau 


T 


X 


Pairs that are often confused are zeta 


upsilon 


T 


V 


and xi, or nu and upsilon, which look 


phi 


4) 




alike; and chi and xi, or epsilon and 


chi 


X 


X 


upsilon, which sound alike. 


psi 




¥ 




omega 


a 


to 
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Numbers 



Notation 


Meaning 


Example 


N 

Z 
M 

a/b or - 
b 

a b 

b 
i=a 


Natural numbers 

Integers 
Real numbers 
modulus 

a over b 

a divides b 

m choose n 

n factorial 

+-*a+l H ^^b 

(see section on Summation below) 
X is approximately equal to y 


12 3 

(some people include 0) 

...,-2,-1,0,1,2,... 

J,v^,7r,... 

2 = 2, -3 = 3 

12/3 = 4, 2/4 = 0.5 
4 1 12 

^C2 = 10 

5! = 120 

3 

£i2= 1^ + 2^ + 3^= 14 

i=\ 



Sets 



Notation 


Meaning 


Example 


{...} 


a set 


{1,2,3} 






NOTE: {1,2} = {2,1} 


xe A 


X is an element of the set A 


2e{l,2,3} 


{x:...} 


the set of all x such that . . . 


{x:x2 = 4} = {-2,2} 


ox {x\ ...} 






|A| 


cardinality of A 


|{1,2,3}|=3 




(number of elements in A) 




AU5 


A union 5 


{1,2,3}U{2,4} = {1,2,3,4} 




(elements in either A or 5) 




AHB 


A intersection 5 


{l,2,3}n{2,4} = {2} 




(elements in both A and 5) 




A\B 


set difference 


{1,2,3}\{2,4} = {1,3} 




(elements in A but not B) 




A C5 


A is a subset of B (or equal) 


{1,3} C {1,2,3} 


A' 


complement of A 


everything not in A 





empty set (no elements) 


{l,2}n{3,4} = 


i.x,y) 


ordered pair 


NOTE: (1,2) ^ (2,1) 


AxB 


Cartesian product 


{1,2} X {1,3} = 




(set of all ordered pairs) 


{(1,1),(2,1),(1,3),(2,3)} 
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Summation 

What is it? 

Let ai , a2 , fl3 , . . . be numbers. The notation 

n 
i=l 

(read "sum, from i equals 1 to n, of a"), means: add up the numbers ai , a2, . . . , a„; 
that is, 

n 

^ai = ai+a2-\ 

i=l 

n 

The notation ^ aj means exactly the same thing. The variable i or j is called 

7=1 

a "dummy variable". 

m 

The notation ^ a, is not the same, since (if m and n are different) it is telling 

1=1 

us to add up a different number of terms. 

The sum doesn't have to start at 1. For example, 

20 

^ Ui = flio + fliiH l-fl20- 

(=10 

Sometimes I get lazy and don't bother to write out the values: I just say 

to mean: add up all the relevant values. For example, if X is a discrete random 
variable, then we say that 

E{X) = Y^aiP{X = ai) 

i 

where the sum is over all / such that a,- is a value of the random variable X. 
Manipulation 

The following three rules hold. 

n n n 

£ {at + Z7,) = a/ + bi. (A. 1) 

(=1 (=1 i=\ 

Imagine the as and bs written out with a\-\-b\ on the first line, a2 + b2 on the 
second Une, and so on. The left-hand side says: add the two terms in each line. 
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M 











and then add up all the results. The right-hand side says: add the first column (all 
the as) and the second column (all the bs), and then add the results. The answers 
must be the same. 



l^l^aibj. (A.2) 

/=ij=i 

The double sum says add up all these products, for all values of i and j. A simple 
example shows how it works: (ai + 02) {bi + ^2) = aibi + ai&2 + «2^l + «2^2- 

If in place of numbers, we have functions of x, then we can "differentiate 
term-by-term": 

i+l i+1 ^ 

The left-hand side says: add up the functions and differentiate the sum. The right 
says: differentiate each function and add up the derivatives. 

Another useful result is the Binomial Theorem: 



n 



k=0 



Infinite sums 



Sometimes we meet infinite sums, which we write as ^ a,- for example. This 

doesn't just mean "add up infinitely many values", since that is not possible. We 
need Analysis to give us a definition in general. But sometimes we know the 
answer another way: for example, if at — ar'^, where — 1 < r < 1, then 

7 ai = a + ar + ar H = - , 

r^, l — r 

1=1 

using the formula for the sum of the "geometric series". You also need to know 
the sum of the "exponential series" 

00 i 2 3 4 

EX _ JC JC JC -y 

1=0 

Do the three rules of the preceding section hold? Sometimes yes, sometimes 
no. In Analysis you will see some answers to this question. 

In all the examples you meet in this book, the rules will be valid. 



Appendix B 

Probability and random variables 



Notation 

In the table, A and B are events, X and Y are random variables. 



Notation 


Meaning 


Page 


P{A) 


probability of A 


3 


P{A 1 B) 


conditional probability of A given B 


24 


X = Y 


the values of X and Y are equal 




Xr^Y 


X and Y have the same distribution 
(that is, same p.m.f. or same p.d.f.) 


41 


E{X) 


expected value of X 


41 


Var(Z) 


variance of X 


42 


Cov(Z,F) 


covariance of X and Y 


67 


corr(X,y) 


correlation coefficient of X and Y 


68 


X \B 


conditional random variable 


70 


X\{Y = b) 
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Bernoulli random variable Bernoulli(/)) (p. 48) 

• Occurs when there is a single trial with a fixed probability p of success. 

• Takes only the values and 1. 

• p.m.f. P{X = 0) = q,P{X = l)=p, where q=l-p. 

• E{X) = p, Var(X) = pq. 
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Binomial random variable Bm{n,p) (p. 49) 

• Occurs when we are counting the number of successes in n independent 
trials with fixed probabiUty p of success in each trial, e.g. the number of 
heads in n coin tosses. Also, sampling with replacement from a population 
with a proportion p of distinguished elements. 

• The sum of n independent Bernoulli(p) random variables. 

• ValuesO, 1,2, 

• p.m.f. P{X = yfc) = "Ckq^-^p^ for < < n, where q=\-p. 

• E{X) = np, Var(X) = npq. 

Hypergeometric random variable ltLg{n,M,N) (p. 51) 

• Occurs when we are sampling n elements without replacement from a pop- 
ulation of N elements of which M are distinguished. 

• Values 0, 1,2, ... ,n. 

. p.m.f. P{X = k) = (^Q • ^-^C„_,)/^Q. 

/ , fM\ , , (M\ /N-M\ /N-n\ 

• Approximately Bin{n,M/N) if n is small compared to N,M,N — M. 

Geometric random variable Geom(p) (p. 52) 

• Describes the number of trials up to and including the first success in a 
sequence of independent Bernoulli trials, e.g. number of tosses until the 
first head when tossing a coin. 

• Values 1,2,... (any positive integer). 

• p.m.f. P{X — k) — q^^^p, where q— I— p. 

• E{X) = 1/p, Var(Z) = q/p^. 
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Poisson random variable Poisson(X) (p. 54) 

• Describes the number of occurrences of a random event in a fixed time 
interval, e.g. the number of fish caught in a day. 

• Values 0, 1, 2, . . . (any non-negative integer) 

• p.m.f. P{X = k) = e-^TJ'/kl . 

• E{X) = X, Var(Z) = I. 

• If R is large, p is small, and np = X, then Bin{n,p) is approximately equal 
to Poisson(A,) (in the sense that the p.m.f.s are approximately equal). 

Uniform random variable U[a,b] (p. 58) 

• Occurs when a number is chosen at random from the interval [a,b], with all 
values equally likely. 

{0 if X < a, 

l/{b-a) if a<x<b, 
ifx>b. 

{0 ifx<a, 
{x-a)/{b-a) if a<x<b, 
1 ifx>b. 

• E(X) = (a + b)/2,Yar(X) = (b-af/l2. 

Exponential random variable Exp(X) (p. 59) 

• Occurs in the same situations as the Poisson random variable, but measures 
the time from now until the first occurrence of the event. 

A f jr/ \ / if X < 0, 

.p.d.f./(x) = |^^_^ ifx>0. 

Af i7r \ /O ifx<0, 
.c.d.f.F(x) = |^_^_^ ifx>0. 

• E(X) = l/X, Var(X) = l/A.^. 

• However long you wait, the time until the next occurrence has the same 
distribution. 
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APPENDIX B. PROB ABILITY AND RANDOM VARIABLES 



Normal random variable A/^(;u,a^) (p. 59) 

• The limit of the sum (or average) of many independent Bernoulli random 
variables. This also works for many other types of random variables: this 
statement is known as the Central Limit Theorem. 

• p.d.f. f{x) = ^~{x-nfno\ 

• No simple formula for c.d.f.; use tables. 

• Var(Z) =o2. 

• For large n, Bin{n,p) is approximately N{np,npq). 

• Standard normal A^(0, 1) is given in the table. If Z ~ N{/u,g'^), then {X - 

lj)/Or.NiO,l). 

The c.d.f.s of the Binomial, Poisson, and Standard Normal random variables 
are tabulated in the New Cambridge Statistical Tables, Tables 1, 2 and 4. 



